Wednesday, October 24, 2018

Pause the update of the blog

I am currently working on writing a book, Python Programming and Numerical Methods: A Guide for Engineers and Scientists. This is a book will be used for the data science efforts at Berkeley Division of Data Sciences here and will be used as the textbook for a course next year. 

Therefore, I don't have time to update my blog before I can finish the book. My estimation to finish the book is about April 2019. Hopefully after that, I could continue to work on my blog. Thanks. 


Sunday, September 16, 2018

Spell checking in Jupyter notebook markdown cells

These days, I started to write more and more in Jupyter notebook, not only my blog, but also an entire book within it. Therefore, spell checking is very important to me. But the online grammarly check doesn't work with the notebook, so that I need a new way to do it. 
After googling it, it seems adding the spell check for the markdown cell is easy. It is all based on the discussions here. The basic steps are:
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter nbextension enable spellchecker/main
After successfully running the above commands, you should see the screen like this:
Success install
Now if you launch the Jupyter notebook, and in the dashboard, you will see a new option for the Nbextensions. 
nbextension
Within it, you can see all the extensions and the ones you enabled. 
Enable nbextension
Now, let's go to the notebook, and type something in a markdown cell, we could see that it can highlight the word we misspelled. 
Spell check

Saturday, September 8, 2018

Quickly remove the old kernels for Jupyter

Over time, when we use Jupyter notebook or Jupyter lab, we will have more and more old kernels. As the following figure shows, I have the old Julia kernels that I want to get rid of.
Then I found it is very easy to get rid of these old kernels. The first thing we could do is to run the command to see what kernels do we have: jupyter kernelspec list


Then we could delete the old kernel by deleting the folder showing in the list and then install the new ones. See the following figure after I deleted the kernels and installed the new kernels. 


Thursday, September 6, 2018

Slice datetime with milliseconds in Pandas dataframe

I use pandas a lot for dealing with time series. Especially the function that it could easily slice the time range you want. But recently, I need to slice between two timestamps with milliseconds, then it is not straightforward. It took me some time to figure it out (I didn't find any useful information online). Therefore, I just document it here if you have the same problem. You can find the notebook version of the blog on Qingkai's Github
import pandas as pd

First generate a dataframe with datetime as the index

Let's first generate a dataframe with datatime as the index and a counter as another column. 
## define the start and end of the time range
t0 = '2018-01-01 00:00:00.000'
t1 = '2018-01-02 00:00:00.000'

## generate the sequence with a step of 100 milliseconds
df_times = pd.date_range(t0, t1, freq = '100L', tz= "UTC")

## put this into a dataframe
df = pd.DataFrame()
df['datetime'] = df_times
df['count'] = range(len(df_times))
df = df.set_index('datetime')
df.head()
count
datetime
2018-01-01 00:00:00+00:000
2018-01-01 00:00:00.100000+00:001
2018-01-01 00:00:00.200000+00:002
2018-01-01 00:00:00.300000+00:003
2018-01-01 00:00:00.400000+00:004

Let's slice a time range

We first slice the data between two times. We can see it works well without the milliseconds in the start and end time
df['2018-01-01 00:00:00':'2018-01-01 00:00:01']
count
datetime
2018-01-01 00:00:00+00:000
2018-01-01 00:00:00.100000+00:001
2018-01-01 00:00:00.200000+00:002
2018-01-01 00:00:00.300000+00:003
2018-01-01 00:00:00.400000+00:004
2018-01-01 00:00:00.500000+00:005
2018-01-01 00:00:00.600000+00:006
2018-01-01 00:00:00.700000+00:007
2018-01-01 00:00:00.800000+00:008
2018-01-01 00:00:00.900000+00:009
2018-01-01 00:00:01+00:0010
2018-01-01 00:00:01.100000+00:0011
2018-01-01 00:00:01.200000+00:0012
2018-01-01 00:00:01.300000+00:0013
2018-01-01 00:00:01.400000+00:0014
2018-01-01 00:00:01.500000+00:0015
2018-01-01 00:00:01.600000+00:0016
2018-01-01 00:00:01.700000+00:0017
2018-01-01 00:00:01.800000+00:0018
2018-01-01 00:00:01.900000+00:0019
What if I want to slice two times with milliseconds as the following, we could see that we experience an error that has no information to help us to identify what happened. 
df['2018-01-01 00:00:00.500':'2018-01-01 00:00:01.200']
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

~/miniconda2/envs/python3/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py in slice_indexer(self, start, end, step, kind)
   1527         try:
-> 1528             return Index.slice_indexer(self, start, end, step, kind=kind)
   1529         except KeyError:


~/miniconda2/envs/python3/lib/python3.6/site-packages/pandas/core/indexes/base.py in slice_indexer(self, start, end, step, kind)
   3456         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 3457                                                  kind=kind)
   3458 


~/miniconda2/envs/python3/lib/python3.6/site-packages/pandas/core/indexes/base.py in slice_locs(self, start, end, step, kind)
   3657         if start is not None:
-> 3658             start_slice = self.get_slice_bound(start, 'left', kind)
   3659         if start_slice is None:


~/miniconda2/envs/python3/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   3583         # to datetime boundary according to its resolution.
-> 3584         label = self._maybe_cast_slice_bound(label, side, kind)
   3585 


~/miniconda2/envs/python3/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py in _maybe_cast_slice_bound(self, label, side, kind)
   1480             _, parsed, reso = parse_time_string(label, freq)
-> 1481             lower, upper = self._parsed_string_to_bounds(reso, parsed)
   1482             # lower, upper form the half-open interval:


~/miniconda2/envs/python3/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py in _parsed_string_to_bounds(self, reso, parsed)
   1317         else:
-> 1318             raise KeyError
   1319 


KeyError: 


During handling of the above exception, another exception occurred:


KeyError                                  Traceback (most recent call last)

<ipython-input-4-803941334466> in <module>()
----> 1 df['2018-01-01 00:00:00.500':'2018-01-01 00:00:01.200']


~/miniconda2/envs/python3/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2125 
   2126         # see if we can slice the rows
-> 2127         indexer = convert_to_index_sliceable(self, key)
   2128         if indexer is not None:
   2129             return self._getitem_slice(indexer)


~/miniconda2/envs/python3/lib/python3.6/site-packages/pandas/core/indexing.py in convert_to_index_sliceable(obj, key)
   1978     idx = obj.index
   1979     if isinstance(key, slice):
-> 1980         return idx._convert_slice_indexer(key, kind='getitem')
   1981 
   1982     elif isinstance(key, compat.string_types):


~/miniconda2/envs/python3/lib/python3.6/site-packages/pandas/core/indexes/base.py in _convert_slice_indexer(self, key, kind)
   1462         else:
   1463             try:
-> 1464                 indexer = self.slice_indexer(start, stop, step, kind=kind)
   1465             except Exception:
   1466                 if is_index_slice:


~/miniconda2/envs/python3/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py in slice_indexer(self, start, end, step, kind)
   1536                 if start is not None:
   1537                     start_casted = self._maybe_cast_slice_bound(
-> 1538                         start, 'left', kind)
   1539                     mask = start_casted <= self
   1540 


~/miniconda2/envs/python3/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py in _maybe_cast_slice_bound(self, label, side, kind)
   1479                            getattr(self, 'inferred_freq', None))
   1480             _, parsed, reso = parse_time_string(label, freq)
-> 1481             lower, upper = self._parsed_string_to_bounds(reso, parsed)
   1482             # lower, upper form the half-open interval:
   1483             #   [parsed, parsed + 1 freq)


~/miniconda2/envs/python3/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py in _parsed_string_to_bounds(self, reso, parsed)
   1316             return (Timestamp(st, tz=self.tz), Timestamp(st, tz=self.tz))
   1317         else:
-> 1318             raise KeyError
   1319 
   1320     def _partial_date_slice(self, reso, parsed, use_lhs=True, use_rhs=True):


KeyError: 

The solution

I found out an easy solution to the problem, instead of directly slice the data, we first find the index that meets our requirements, and then use the index to find the data as shown below:
ix = (df.index >= '2018-01-01 00:00:00.500') & (df.index <='2018-01-01 00:00:01.200')
df[ix]

Tuesday, September 4, 2018

My Researcher experience for one month

I've been in my new position - Assistant Data Science Researcher for one month now, it is so different than that of a Ph.D. student. I want to use this blog to document my experience with this new position for this one month.

First of all, I am a little surprised by this researcher position. Because at the beginning, I am a little sad that I didn't find a faculty position and think this researcher position is just an enhanced postdoc position. But after one month, I found this position is a light version of a faculty position that I could practice a lot of things I wouldn't have chance as a postdoc (actually, at some other university, this position is called - Research Professor, but at Berkeley, we don't have this title). Why light? (1) We don't have a startup as a faculty to start our research team; (2) We are on a soft-money position, which means that our salary has a certain percentage fixed, all the others that we need to write proposals to get myself funded (it ranges from 6 month - 12 months at different cases, luckily, I do have some portion come as fixed for now). In contrast, a faculty will have 9-month salary fixed, and only 3-month's salary needs to be raised every year, much better and easier than 6 to 12 month; (3) We can not grant degrees to students, at least for assistant researcher (I heard at a certain level, we could serve as the committee member of Ph.D. students, but I forgot when); (4) Researchers don't have the obligation to teach a class or serve administration services, all we need to do is to work on our research. I guess these are the main differences.

On the contrary, there are also many similarities between researcher and faculty position. (1) We could serve as a PI to apply for research grants; (2) We could build our own research team to work on interesting things; (3) We have similar ranks, i.e. assistant, associate, and full researchers.

Anyway, back to my experience of this one month. I was actively thinking about the following things as a researcher:
  • Now I need to think more ideas and write proposals to get myself funded, and then potentially to start my research team by hiring some students/postdocs to work with me (of course, maybe co-advice with some faculties). 
  • I think I should dream big at the beginning about set up this research team, this will determine my future level. Since if I could have a large research team by having some students working with me, then I could get more work done, which in turn may attract more funding. This is kind of like a positive feedback loop. 
  • I should do something creative work, even though it may take longer time to get some publication, as long as it is something fun and could distinguish myself, I think it will worth pursuing. 
  • I also found myself is deviated more and more from the technical work I really like to do, for example, programming, building machine learning models, etc. As a researcher, now I do more thinking, more meetings, more writing on proposals, but less technical work as I did as a student. But I guess this is good because I can really think something fun to work on, and then in the future to find some students to work out the ideas. 
Overall, it is a great experience, even though I do feel higher pressure now. But I am happy to practice and experience my skills on this light version of the faculty position. I hope in the near future, I could get a faculty position after improving my skills on this position. Let's see more things may change on me in the future......

Thursday, August 16, 2018

My business card

I got my business card for my new position at Berkeley, I really like the design, see the following for my contact information. 


Sunday, August 12, 2018

Setup Jupyter notebook for Julia

I think this week, the biggest news for me, maybe is Julia 1.0 released. It is a great language that I was thinking to learn for a long time (I just played it briefly before), therefore, I think I could use this 1.0 release as an excuse to learn it. 
Julia is a great language that has a goal to put the best features from different languages into one:
Using the reasons from Why we created Julia:
We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.
See the key names of the languages: C, Ruby, Lisp, Matlab, Python, R, Perl, Shell, isn't really cool to learn this Julia! I will! 

Setup the environment

Let's see how could we setup the environment using Jupyter notebook + julia (I really like to work in Jupyter notebook, therefore, this is my first step to set it up). I assume you already had Jupyter notebook. 
  1. Download Julia from - Download Julia
  2. Add Julia into your .bashrc file - export PATH=/Applications/Julia-1.0.app/Contents/Resources/julia/bin:$PATH
  3. In the terminal, open Julia
  4. In Julia, install IJulia: png
  5. Open Jupter notebook, and select kernel: Julia 1.0.0 (note that, I had some old kernels when I played before) png
  6. Let's run the famous 'Hello World!'
println("Hello World!")
Hello World!

Sunday, August 5, 2018

My first NSF proposal

In the last few weeks, I was intensively working on my first NSF proposal - Computer and Information Science and Engineering (CISE) Research Initiation Initiative (CRII). Because as a data science researcher here at Berkeley, I need to write my own proposal and get funded to do my research. The CRII grant is aimed to help the early career scientists to start their own research by providing them a two-year up to $175,000 funding. I did participate in some proposal preparation during my Ph.D. study, but mostly prepare one paragraph or some figures for my advisor. Therefore, this proposal is really my very first proposal written solely by me. I am documenting this very first experience for future references. Note that, it is not your guide, since I am not sure whether some the things I write down is correct or not, but hope some parts are useful to you if you are also writing an NSF proposal. 

Proposal idea

It starts when we saw a program that fit the idea we have on NSF website, after a check of all the requirements and the due date, we think it is a good solicitation that I should try to apply for. 

Proposal Preparation

Since this is my first proposal (I am the sole PI on it), therefore, there are so many things I need to prepare. First, not related to the proposal writing: 
  1. I need first sign up on the NSF Fastlane first and ask the campus to associate my account to UC Berkeley.
  2. Get a clear idea of the whole submitting process. For example, I learned that I need to submit to the campus 5 business days earlier than the actual deadline, even though I could still change my technical details in the proposal. To whom on campus I will submit to and get a draft budget for the proposal. And so on ...
  3. Get some sample proposals from my colleagues and friends, all together I got 4 copies of them, 2 in Earth Science and 2 in Computer Science (since the proposal I submitted in the NSF Computer Science Division for a data science related proposal). 
Finally get to the part of the proposals itself. For this proposal, I need to prepare:
  1. BIO sketch
  2. Project one-page summary
  3. Project description
  4. Data Management Plan
  5. Facilities, equipment and other resources
  6. Chair support letter
  7. Collaboration letters
  8. Past collaborators
  9. Budget
  10. Budget justification 
I will mainly talk about the project description (10-15 page), since most of the others can be written using some templates and change them according to my proposed work, and the one-page summary is what your shortened version of the project description by giving only the introduction and the Intellectual Merit and Broader Impacts. 

Project description

The project description is the main content of my proposal, it includes: 
  • Vision statements
  • Backgrounds
  • Research plans 
  • Evaluation plans
  • Intellectual Merit
  • Broader Impacts
  • Results from Prior NSF Support
The vision statement is my one page to describe my vision of the proposed work, why it is important? Why the challenges are interesting and useful are the questions to be addressed here. 
The background section is where I listed the current and related research in the fields to show that I am aware of the whole fields, know the work has been done, know the limitations of the current work and then transit into the next section that I introduce my solutions to some of the current limitations. 
The research plans section is the heavy part of this description, this is the part I list the proposed work which could potentially address the existing challenges. I start with a description of the current challenges and then my proposed methods to address them. I was told that the proposed methods should be detailed enough that the reviewers would see you have a viable working plan, but not too specific, since the work is not been done yet. It is better to mention some alternative ways as well, since if your proposed methods are not working, the reviewers want to see if you have a different plan to fix it. Also, I put some preliminary results I have done for the proposed work.
In the evaluation plans, I listed the metrics to evaluate the work I proposed and the potential deliverables. 
The Intellectual Merit is the section that I try to explain how great my methods are to address the current challenges, and also why they are important. 
In the broader impacts section, I listed how this work will benefit the society and public; how it will facilitate potential collaborations between academia, industry, public and government; how it will train the undergraduate students, graduate students, and postdocs, the next generation workforce. Lastly, how the results will be disseminated by attending conferences, publishing papers, giving public presentations, participating science faires, maker faires and so on; how we could add the problems, results into the courses we are teaching here at Berkeley to give students more chances to get access to real-world problems in class. 
And last, the results from Prior NSF Support section, since I don't have any previous research results using NSF's funding, I didn't list anything here. But I do list my I-Corps award from NSF last year to show that I do interact with NSF before. 

A note on the budget

I was thinking I need to prepare the budget from scratch, but it is not. I only need to send an email to the grant office on campus (different names at different universities, but I'd like to call them grant office), and tell them what do I need. Then they will send a draft excel spreadsheet with all the numbers tunable by you to hit the target number. This saves my efforts, thanks to the colleagues at the grant office. 
NSF favors the budget to spend mostly on the students or postdocs (which is a very good thing), as a PI, the maximum number of months you can request for one proposal is 2 months (I heard you can request more, but you need a justification for that). Besides the salary for students/postdocs and the PI, budget on travel for conferences, and computational services or equipment. But all these needs justification why do you need them. 

Quick reflections on my first experience

  • I found the project description is the key to success, do start early and iterate with your mentors to make improvement on it. 
  • The sample previous proposals from your colleagues are really useful, try to go through them carefully and you will learn a lot from them. 
  • I read online that it is better to have the first proposal written with a senior scientist instead of being a sole PI on it. I think this is a very good advice. 
  • The background section requires you do a literature search, do prepare time for that to show your understanding of the field. 
  • Don't squeeze to much stuff into the proposal, make it simple and clear, and the reviewers will be happy about it. 
  • It is better to have some pictures in the proposal, a good rule of thumb I was told is to have one picture every other page. After all, one picture worth 1000 words. 
  • Leave more time on, I only have about 3 weeks to work on this proposal, I think the time is too short. 
  • After you write the whole package of the proposal, wait for a few days to go through them one more time, and you will find some better ideas (at least for me). 
  • My Earth science colleagues told me in the project description, it is better to write in a way that starts with a hypothesis and end with a potential solution. But my computer science colleagues don't think I need that, nor the sample proposals from CS have that. Therefore, there is clearly a difference between the different divisions. 
  • For the broader impact, I was told that it should be a summary for layman, since NSF will show this paragraph to the legislators and public. 
  • I was too afraid to talk to a program manager, but many people told me I should have, since they are usually very nice and give you a lot of ideas and first round comments. Maybe next time. 
  • The chance for me to get the grant is very slim, but at least, I went through the whole process, and now have a much better view of it. This will definitely help my next proposal. Also, hope that I could get some valuable comments from the reviewers so that I could make improvement. 

Acknowledgment

Of course, I need to thank all the people who helped me during my preparation, especially my friend professors from the University of Colorado Boulder, San Fransisco State University, my advisor and colleagues at Berkeley. They are my mentors to get through this first proposal, without them, it will be much more difficult for me.

Sunday, July 29, 2018

Making a map with clustered markers

This week I was making a map with many points on the map - plot all the GPS stations from UNR's Nevada Geodetic Laboratory, they have a google map that can show all the GPS stations. See the following figure:
png
But I want to plot a nice map that has the clustered markers that I can easily see the number of stations in a region. Folium, the package I usually use could do it, but not so beautiful. Therefore, I changed to use google map to generate the nice clustered markers. Here is the code that I modified slightly from Google developers. You can see the following figure as the final results I have (you can find all the locations of the GPS stations here):
png
The code is in javascript, but it should be simple to use it, note that, you need to get your own API key from google, but it should be easy to get it and replace the 'YOUR_API_KEY'
<!DOCTYPE html>
<html>
  <head>
    <meta name="viewport" content="initial-scale=1.0, user-scalable=no">
    <meta charset="utf-8">
    <title>GPS stations</title>
    <style>
      /* Always set the map height explicitly to define the size of the div
       * element that contains the map. */
      #map {
        height: 100%;
      }
      /* Optional: Makes the sample page fill the window. */
      html, body {
        height: 100%;
        margin: 0;
        padding: 0;
      }
    </style>
  </head>
  <body>
    <div id="map"></div>
    <script>

      function initMap() {

        var map = new google.maps.Map(document.getElementById('map'), {
          zoom: 1,
          center: {lat: 37.5, lng: -123.5}
        });


        // Note: The code uses the JavaScript Array.prototype.map() method to
        // create an array of markers based on a given "locations" array.
        /* The map() method here has nothing to do with the Google Maps API,  
        it creates a new array with the results of calling a provided function  
        on every element in the calling array*/
        var markers = locations.map(function(location) {
          return new google.maps.Marker({
            position: location,
          });
        });

        // Add a marker clusterer to manage the markers.  
        // The imagePath gives the icon of the clusters
        var markerCluster = new MarkerClusterer(map, markers,
            {imagePath: 'https://developers.google.com/maps/documentation/javascript/examples/markerclusterer/m'});
      }
      var locations = [
        {lat: -12.466640, lng: -229.156013},
        {lat: -12.478224, lng: -229.017953},
        {lat: -12.355923, lng: -229.118271},
        {lat: 30.407425, lng: -91.180262},
        {lat: 31.750800, lng: -93.097604},
        {lat: 32.529034, lng: -92.075906},
        {lat: -23.698194, lng: -226.117249},
        {lat: -23.766992, lng: -226.120783},
        {lat: -37.771915, lng: -67.715566},
        {lat: 64.028926, lng: -142.075778},
        {lat: -33.768794, lng: -208.883665},
        {lat: -23.698190, lng: -226.117246},
        {lat: -23.766988, lng: -226.120781},
        {lat: 41.838658, lng: -119.653981},
        {lat: 41.853118, lng: -119.607364},
        {lat: 41.850735, lng: -119.577144},
      ]
    </script>
    <script src="https://developers.google.com/maps/documentation/javascript/examples/markerclusterer/markerclusterer.js">
    </script>
    <script async defer
    src="https://maps.googleapis.com/maps/api/js?key=YOUR_API_KEY&callback=initMap">
    </script>
  </body>
</html>