Saturday, July 22, 2017

Machine learning 16: Using scikit-learn Part 4 - Unsupervised learning

The material is based on my workshop at Berkeley - Machine learning with scikit-learn. I convert it here so that there will be more explanation. Note that, the code is written using Python 3.6. It is better to read the slides I have first, which you can find it here. You can find the notebook on Qingkai's Github. 
This week, we will talk how to use scikit-learn for unsupervised learning, we will talk one example in dimensionality reduction and one in clustering. 

Unsupervised learning

Unsupervised Learning addresses a different sort of problem. Here the data has no labels, and we are interested in finding similarities between the objects in question. In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself. 
Unsupervised learning comprises tasks such as dimensionality reduction, clustering, and density estimation. For example, in the iris data discussed before, we can use unsupervised methods to determine combinations of the measurements which best display the structure of the data. As we'll see below, such a projection of the data can be used to visualize the four-dimensional dataset in two dimensions. 

Dimensionality reduction with PCA

Principle Component Analysis (PCA) is a dimension reduction technique that can find the combinations of variables that explain the most variance. Consider the iris dataset. It cannot be visualized in a single 2D plot, as it has 4 features. We are going to extract 2 combinations of sepal and petal dimensions to visualize it:
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('seaborn-poster')
%matplotlib inline
iris = datasets.load_iris()
X = iris.data
print("The dataset shape:", X.shape)

X, y = iris.data, iris.target
The dataset shape: (150, 4)
Use PCA, we can reduce the dimensions from 4 into 2 and visualize it. 
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X)
X_reduced = pca.transform(X)
print("Reduced dataset shape:", X_reduced.shape)
Reduced dataset shape: (150, 2)
plt.figure(figsize=(10,8))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='RdYlBu')
plt.xlabel('First component')
plt.ylabel('Second component')
<matplotlib.text.Text at 0x1129a3cf8>
png

Clustering with K-means

K Means is an algorithm for unsupervised clustering: that is, finding clusters in data based on the data attributes alone (not the labels).
K Means is a relatively easy-to-understand algorithm. It searches for cluster centers which are the mean of the points within them, such that every point is closest to the cluster center it is assigned to.
Let's look at how KMeans operates on the simple clusters we looked at previously - The Iris dataset. To emphasize that this is unsupervised, we'll not plot the colors of the clusters:

Train K-means

from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=2)
k_means.fit(X)
y_pred = k_means.predict(X)

plt.figure(figsize=(10,8))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred, cmap='RdYlBu')
plt.xlabel('First component')
plt.ylabel('Second component')
<matplotlib.text.Text at 0x112ec65c0>
png

Excercise

When we use PCA, visualization is just one purpose. Sometimes, we have high dimensional data that we want to use PCA to reduce the dimensionality while keep certain amount of information in the new PCA transformed data. 
In this exercise, please use PCA on the Iris data and keep the components that explained 95% of the variance of the original data. 
############################### Solution 1 #################################

# fit a PCA model
pca = PCA().fit(X)

plt.figure(figsize=(10,8))
plt.plot(range(1, X.shape[1]+1), np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

# we can see that with the first two components, we can explain over 95% of the variance, 
# then we can train PCA with only 2 components
# fit a PCA model
pca = PCA(n_components = 2).fit(X)
X_reduced = pca.transform(X)
png
############################### Solution 2 #################################
## Or we can simplely use n_components = 0.95
pca = PCA(n_components = 0.95).fit(X)
X_reduced = pca.transform(X)

Saturday, July 15, 2017

Machine learning 15: Using scikit-learn Part 3 - Regression

The material is based on my workshop at Berkeley - Machine learning with scikit-learn. I convert it here so that there will be more explanation. Note that, the code is written using Python 3.6. It is better to read the slides I have first, which you can find it here. You can find the notebook on Qingkai's Github. 
This week, we will talk how to use scikit-learn for regression problems. Instead of simple linear regression, we will do a regression problem on a non-linear dataset that we generate by ourselves.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
plt.style.use('seaborn-poster')
%matplotlib inline

Generate data

Let's first generate a toy dataset that we will use a Random Forest model to fit it. We generate a periodical dataset using two sine wave with different period, and then add some noise to it. It can be visualized in the following figure:
np.random.seed(0)
x = 10 * np.random.rand(100)

def model(x, sigma=0.3):
    fast_oscillation = np.sin(5 * x)
    slow_oscillation = np.sin(0.5 * x)
    noise = sigma * np.random.randn(len(x))

    return slow_oscillation + fast_oscillation + noise

plt.figure(figsize = (12,10))
y = model(x)
plt.errorbar(x, y, 0.3, fmt='o')
<Container object of 3 artists>
png

Fit a Random Forest Model

We will use random forest, a method that based on decision trees. The idea actually is very simple, if we look at the following figure the blue line, we just ask some questions like this: if my data is between 0.5 and 3.3, then my target value will be 0.7. We can think the regression line is made up of many segments of flat lines, therefore, we see many step-like lines on the following graph. 
png
In the following, we fit a random forest model with 100 trees (the more trees we use, the more flexible the model is, that we can model wiggly part), and all the other parameters are using the default. 
xfit = np.linspace(0, 10, 1000)

# fit the model and get the estimation for each data points
yfit = RandomForestRegressor(100, random_state=42).fit(x[:, None], y).predict(xfit[:, None])
ytrue = model(xfit, 0)

plt.figure(figsize = (12,10))
plt.errorbar(x, y, 0.3, fmt='o')
plt.plot(xfit, yfit, '-r', label = 'predicted', zorder = 10)
plt.plot(xfit, ytrue, '-k', alpha=0.5, label = 'true model', zorder = 10)
plt.legend()
<matplotlib.legend.Legend at 0x111c02f60>
png
Print out the misfit using the mean squared error.
mse = mean_squared_error(ytrue, yfit)
print(mse)
0.0869576380256

Using ANN

We can also use ANN for regression as well, the difference will be at the activation function in the output layer. Instead of using some functions like tanh or sigmoid to squeenze the results to a range between 0 and 1, we can use some linear activation function to generate any results. 
from sklearn.neural_network import MLPRegressor
mlp = MLPRegressor(hidden_layer_sizes=(200,200,200), max_iter = 2000, solver='lbfgs', \
                   alpha=0.01, activation = 'tanh', random_state = 8)

yfit = mlp.fit(x[:, None], y).predict(xfit[:, None])

plt.figure(figsize = (12,10))
plt.errorbar(x, y, 0.3, fmt='o')
plt.plot(xfit, yfit, '-r', label = 'predicted', zorder = 10)
plt.plot(xfit, ytrue, '-k', alpha=0.5, label = 'true model', zorder = 10)
plt.legend()
<matplotlib.legend.Legend at 0x111e2a0f0>
png
mse = mean_squared_error(ytrue, yfit)
print(mse)
0.161981739823

Using Support Vector Machine

The Support Vector Machine method we talked about in the previous notebook can also be used in regression. Instead of import svm, we import svr for regression probelm. The API is quite similar as the ones we introduced before, here is the quick regression using SVR. 
from sklearn.svm import SVR

# define your model
svr = SVR(C=1000)

# get the estimation from the model
yfit = svr.fit(x[:, None], y).predict(xfit[:, None])

# plot the results as above
plt.figure(figsize = (12,10))
plt.errorbar(x, y, 0.3, fmt='o')
plt.plot(xfit, yfit, '-r', label = 'predicted', zorder = 10)
plt.plot(xfit, ytrue, '-k', alpha=0.5, label = 'true model', zorder = 10)
plt.legend()
<matplotlib.legend.Legend at 0x115bdbb70>
png
mse = mean_squared_error(ytrue, yfit)
print(mse)
0.0289468172074

References