Wednesday, June 13, 2018

Profile your code in Jupyter notebook/lab

We discussed using profiler to profile your code and find out where it is slow in the previous blog, and but you need to run from command line. Today, we will have a look of the profile code in jupyter notebook. Note that, if you haven’t installed ‘line_profiler’, install it first:
pip install line_profiler
Let’s first define some functions to calculate random things. There are three functions that calling one by one.
def square_the_value(x, y):
    a = add_1000_times(x, y)
    return a**2

def add_1000_times(x, y):
    z = 0
    for i in range(1000):
        z += x
        for j in range(1000):
            z += y
    return z

def calculate_my_value(x, y):
    a = x + y
    b = x - y
    print(square_the_value(a, b))
calculate_my_value(1, 2)
Now we want to have an idea of which part of the code running fast and which part running slow. We could use the line_profiler to do the job. First, we need to load the extension:
%load_ext line_profiler
Let’s profile the top level function that we run. We can see that we use ‘%lprun’, which basically run the line_profiler, the ‘-f’ flag is to tell it which function or method we want to profile, and the calculate_my_value(1, 2) is the real statement that we want to run:
%lprun -f calculate_my_value calculate_my_value(1, 2)

Timer unit: 1e-06 s

Total time: 0.295409 s
File: <ipython-input-1-0c3fada21717>
Function: calculate_my_value at line 16

Line #      Hits         Time  Per Hit   % Time  Line Contents
    16                                           def calculate_my_value(x, y):
    18         1          3.0      3.0      0.0      a = x + y
    19         1          1.0      1.0      0.0      b = x - y
    21         1     295405.0 295405.0    100.0      print(square_the_value(a, b))
Now we could see that the line_profiler give us the time to run each line, and what’s the percentage of this line takes. We could see that the last line used all the time. We can continue to profile the last time by entering into the square_the_value function:
%lprun -f square_the_value calculate_my_value(1, 2)

Timer unit: 1e-06 s

Total time: 0.39605 s
File: <ipython-input-1-0c3fada21717>
Function: square_the_value at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def square_the_value(x, y):
     3         1     396048.0 396048.0    100.0      a = add_1000_times(x, y)
     5         1          2.0      2.0      0.0      return a**2
Similarly, we could profile the add_1000_times function to figure out which line really takes all the time:
%lprun -f add_1000_times calculate_my_value(1, 2)

Timer unit: 1e-06 s

Total time: 0.829793 s
File: <ipython-input-1-0c3fada21717>
Function: add_1000_times at line 7

Line #      Hits         Time  Per Hit   % Time  Line Contents
     7                                           def add_1000_times(x, y):
     8         1          1.0      1.0      0.0      z = 0
     9      1001        382.0      0.4      0.0      for i in range(1000):
    10      1000        423.0      0.4      0.1          z += x
    11   1001000     388197.0      0.4     46.8          for j in range(1000):
    12   1000000     440788.0      0.4     53.1              z += y
    14         1          2.0      2.0      0.0      return z
The profiler is really useful, I use it all the time to optimize my code to remove some of the inefficient code. Hope you will find it useful as well.

Friday, June 8, 2018

Turtle graphics

Turtle graphics is a popular way of introducing programming to kids. It was part of the original Logo programming language developed by Wally Feurzig and Seymour Papert in 1966. You can find more information here. With the following lines, you can create this star from the example. 
from turtle import *
color('red', 'yellow')
while True:
    if abs(pos()) < 1:
With some changes, we could make this nice figure, it is really cool to see how it draws this beautiful pattern. 
colors = ['red', 'orange', 'yellow', 'green', 'cyan', 'blue']
p = Pen()
for i in range(360):

Friday, May 25, 2018

Kernel density estimation (animation)

This week, we will briefly talk about kernel density estimation, which is a useful way to estimate the probability density function of a random variable. The idea is quite simple, let's start by showing you example of density estimation using a Gaussian kernel for 1D case. You can find all the code at Qingkai's Github
import numpy as np
import matplotlib.pyplot as plt'seaborn-poster')

%matplotlib inline

Generate a random 1D dataset

We first generate a bi-modal distribution using two Gaussians. 
X = np.concatenate((np.random.normal(loc = -5, scale= 2, size = 100), np.random.normal(loc = 10, scale= 3, size = 100)))
# let's shuffle the order of the data (this is most for the animation later)

plt.figure(figsize = (10,8))
plt.hist(X, bins = 15, normed=True)
plt.plot(X, np.zeros(len(X)), '*k')
plt.ylim((-0.01, 0.1))

Define gaussian kernels

We define the Gaussian kernels as the following function, and we generate a grid and plot an example of Gaussian kernel centered at 5. 
def gaussian_kernel(x, mu, sigma):
    return 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (x - mu)**2 / (2 * sigma**2) )

# generate grid
min_v = np.min(X)
max_v = np.max(X)
grid = np.linspace(min_v, max_v, 100)

# plot out the example
plt.figure(figsize = (10,8))
plt.plot(grid, gaussian_kernel(grid, mu = 5, sigma = 1.0))
plt.title('Example of a kernel centered at 5')

Estimate the density using the kernels

The kernel density estimation is actually very simple, you can think that for each data point, we will center a Gaussian kernel at it, and then we just sum all the kernels, we will have the kernel density estimation. The following is an animation that shows this process, we add one kernel at a time (the grey curves), the red curve is our density estimation that sums up all the grey curves. Note that, the width of the kernel has an effect of the smoothness of the density estimation, see the following example. 
density_estimation = np.zeros(len(grid))

FFMpegWriter = manimation.writers['ffmpeg']
metadata = dict(title='Kernel density estimation', artist='Qingkai Kong',
                comment='This is for gaussian kernels')
writer = FFMpegWriter(fps=10, metadata=metadata)
fig = plt.figure(figsize = (10,8))

line1, = plt.plot(grid, density_estimation, 'r-', label = 'Sum of all kernels')
plt.ylim(-0.01, 0.1)

with writer.saving(fig, "Kernel_density_estimation.mp4", 100):

    for x in X:
        plt.plot(x, 0, '*k')

        kernel = gaussian_kernel(grid, mu = x, sigma = 1.0)
        density_estimation += kernel / len(X)
        plt.plot(grid, kernel / 8, 'k', alpha = 0.1)
        line1.set_data(grid, density_estimation)
Let's see the effect on the density estimation using a wider kernel. 

Kernel density estimation using Scipy

The above code is just to show you the basic idea of the kernel density estimation. In reality, we don't need to write the code by ourselves, we could use the existing packages, either scipy or sklearn. Here we show the example using scipy. We could see that the derived density estimation is very similar to what we got above. 
from scipy import stats
kernel = stats.gaussian_kde(X, 0.1)
plt.figure(figsize = (10,8))
plt.plot(grid,kernel.evaluate(grid), 'r-')

2D density estimation

Besides, we could expand this density estimation into 2 dimensions. The idea is still the same, instead of using the 1D Gaussian kernel to add up to the density estimation, now we use the 2D Gaussian kernels to do the estimation. Let's see the example using scipy for this 2D case. 
# generate data
X = np.concatenate((np.random.normal(loc = -5, scale= 2, size = (100,2)), np.random.normal(loc = 10, scale= 3, size = (100,2))))

# get the mesh
m1, m2 = X[:, 0], X[:, 1]
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()

# get the density estimation 
X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)

# plot the result

fig, ax = plt.subplots(figsize = (10,8))

           extent=[xmin, xmax, ymin, ymax])

ax.plot(m1, m2, 'k.', markersize=5)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])
plt.xlabel('X values')
plt.ylabel('y values')