Qingkai's Blog: May 2018

Friday, May 25, 2018

Kernel density estimation (animation)

This week, we will briefly talk about kernel density estimation, which is a useful way to estimate the probability density function of a random variable. The idea is quite simple, let's start by showing you example of density estimation using a Gaussian kernel for 1D case. You can find all the code at Qingkai's Github.

import numpy as np
import matplotlib.pyplot as plt

plt.style.use('seaborn-poster')

%matplotlib inline

Generate a random 1D dataset

We first generate a bi-modal distribution using two Gaussians.

np.random.seed(12)
X = np.concatenate((np.random.normal(loc = -5, scale= 2, size = 100), np.random.normal(loc = 10, scale= 3, size = 100)))
# let's shuffle the order of the data (this is most for the animation later)
np.random.shuffle(X)

plt.figure(figsize = (10,8))
plt.hist(X, bins = 15, normed=True)
plt.plot(X, np.zeros(len(X)), '*k')
plt.ylim((-0.01, 0.1))
plt.xlabel('Values')
plt.ylabel('Frequencies')
plt.show()

Define gaussian kernels

We define the Gaussian kernels as the following function, and we generate a grid and plot an example of Gaussian kernel centered at 5.

def gaussian_kernel(x, mu, sigma):
    return 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (x - mu)**2 / (2 * sigma**2) )

# generate grid
min_v = np.min(X)
max_v = np.max(X)
grid = np.linspace(min_v, max_v, 100)

# plot out the example
plt.figure(figsize = (10,8))
plt.plot(grid, gaussian_kernel(grid, mu = 5, sigma = 1.0))
plt.title('Example of a kernel centered at 5')
plt.show()

Estimate the density using the kernels

The kernel density estimation is actually very simple, you can think that for each data point, we will center a Gaussian kernel at it, and then we just sum all the kernels, we will have the kernel density estimation. The following is an animation that shows this process, we add one kernel at a time (the grey curves), the red curve is our density estimation that sums up all the grey curves. Note that, the width of the kernel has an effect of the smoothness of the density estimation, see the following example.

density_estimation = np.zeros(len(grid))

FFMpegWriter = manimation.writers['ffmpeg']
metadata = dict(title='Kernel density estimation', artist='Qingkai Kong',
                comment='This is for gaussian kernels')
writer = FFMpegWriter(fps=10, metadata=metadata)
fig = plt.figure(figsize = (10,8))

line1, = plt.plot(grid, density_estimation, 'r-', label = 'Sum of all kernels')
plt.ylim(-0.01, 0.1)
plt.legend()
plt.xlabel('Values')
plt.ylabel('Density')

with writer.saving(fig, "Kernel_density_estimation.mp4", 100):

    for x in X:
        plt.plot(x, 0, '*k')

        kernel = gaussian_kernel(grid, mu = x, sigma = 1.0)
        density_estimation += kernel / len(X)
        plt.plot(grid, kernel / 8, 'k', alpha = 0.1)
        line1.set_data(grid, density_estimation)
        
        writer.grab_frame()

Let's see the effect on the density estimation using a wider kernel.

Kernel density estimation using Scipy

The above code is just to show you the basic idea of the kernel density estimation. In reality, we don't need to write the code by ourselves, we could use the existing packages, either scipy or sklearn. Here we show the example using scipy. We could see that the derived density estimation is very similar to what we got above.

from scipy import stats

kernel = stats.gaussian_kde(X, 0.1)
plt.figure(figsize = (10,8))
plt.plot(grid,kernel.evaluate(grid), 'r-')
plt.xlabel('Values')
plt.ylabel('Density')
plt.show()

2D density estimation

Besides, we could expand this density estimation into 2 dimensions. The idea is still the same, instead of using the 1D Gaussian kernel to add up to the density estimation, now we use the 2D Gaussian kernels to do the estimation. Let's see the example using scipy for this 2D case.

# generate data
np.random.seed(12)
X = np.concatenate((np.random.normal(loc = -5, scale= 2, size = (100,2)), np.random.normal(loc = 10, scale= 3, size = (100,2))))

# get the mesh
m1, m2 = X[:, 0], X[:, 1]
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()

# get the density estimation 
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)

# plot the result

fig, ax = plt.subplots(figsize = (10,8))

plt.imshow(np.rot90(Z), cmap=plt.cm.jet,
           extent=[xmin, xmax, ymin, ymax])

plt.colorbar()
ax.plot(m1, m2, 'k.', markersize=5)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
plt.xlabel('X values')
plt.ylabel('y values')
plt.show()

Officially graduated!!!

I officially graduated from UC Berkeley, and now you can call me Doctor Kong ^_^ Also, I started my new work, here are the two photos from my new office, especially my new bookshelf to keep all my books!

Monday, May 21, 2018

Maker faire 2018

I was at Maker Faire 2018 yesterday, and it is fun as usual!

Sunday, May 6, 2018

Python debug in Jupyter notebook

This week, we will talk about how to debug in the jupyter notebook. From this week, I will also try to use Python3 as much as possible in my code. You can find the notebook at Qingkai's Github.

Let's start by showing you some examples:

Activate debugger after we run the code

The first example is after we have an error occurred, for example, we have a function that squares the input number, but:

def square_number(x):
    
    sq = x**2
    sq += x
    
    return sq

square_number('10')

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-2-e429f5113bd2> in <module>()
----> 1 square_number('10')


<ipython-input-1-822cf3bc7b4e> in square_number(x)
      1 def square_number(x):
      2 
----> 3     sq = x**2
      4     sq += x
      5 


TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'

After we have this error, we could activate the debugger by using the magic command - %debug, which will open an interactive debugger for you. You can type in commands in the debugger to get useful information.

%debug

ipdb> h

Documented commands (type help <topic>):
========================================
EOF    cl         disable  interact  next    psource  rv         unt   
a      clear      display  j         p       q        s          until 
alias  commands   down     jump      pdef    quit     source     up    
args   condition  enable   l         pdoc    r        step       w     
b      cont       exit     list      pfile   restart  tbreak     whatis
break  continue   h        ll        pinfo   return   u          where 
bt     d          help     longlist  pinfo2  retval   unalias  
c      debug      ignore   n         pp      run      undisplay

Miscellaneous help topics:
==========================
exec  pdb

ipdb> p x
'10'
ipdb> type(x)
<class 'str'>
ipdb> p locals()
{'x': '10'}
ipdb> q

Here, the magic command '%debug' activate the interactive debugger pdb in the notebook, and you can type to see the value of the variables after the error occurs, like the ones I typed in above. There are some most frequent commands you can type in the pdb, like:

n(ext) line and run this one
c(ontinue) running until next breakpoint
p(rint) print varibles
l(ist) where you are
'Enter' Repeat the previous command
s(tep) Step into a subroutine
r(eturn) Return out of a subroutine
h(elp) h
q(uit) the debugger

You can find more information about the pdb from here

Activate debugger before run the code

We could also turn on the debugger before we even run the code:

%pdb on

Automatic pdb calling has been turned ON

square_number('10')

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-5-e429f5113bd2> in <module>()
----> 1 square_number('10')


<ipython-input-1-822cf3bc7b4e> in square_number(x)
      1 def square_number(x):
      2 
----> 3     sq = x**2
      4     sq += x
      5 


TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'

ipdb> p sq
*** NameError: name 'sq' is not defined
ipdb> p x
'10'
ipdb> c

# let's turn off the debugger
%pdb off

Automatic pdb calling has been turned OFF

Add a breakpoint

Sometimes we want to add breakpoints to the code

import pdb

def square_number(x):
    
    sq = x**2
    
    # we add a breakpoint here
    pdb.set_trace()
    
    sq += x
    
    return sq

square_number(3)

> <ipython-input-8-4d6192d84091>(8)square_number()
-> sq += x
(Pdb) l
  3         sq = x**2
  4     
  5         # we add a breakpoint here
  6         pdb.set_trace()
  7     
  8  ->     sq += x
  9     
 10         return sq
[EOF]
(Pdb) p x
3
(Pdb) p sq
9
(Pdb) c
12