Sunday, August 7, 2016

Clustering with DBSCAN

I am currently checking out a clustering algorithm: DBSCAN (Density-Based Spatial Clustering of Application with Noise). As the name suggested, it is a density based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), and marks points as outliers if they lie alone in low-density regions. It has many advantages, including no need specify the number of clusters, can find arbitrary shaped clusters, relatively fast, etc. Of course, there's no single algorithm can do everything, DBSCAN has disadvantage as well.
The algorithm has two parameters: epsilon and min_samples, the advantage of the algorithm is that you don’t have to specify how many clusters you need, it can find all the clusters that satisfy the requirement. For the disadvantage, it is very sensitive to the parameter you choose.
The summary of this algorithm is:
Step 1: For each point in the dataset, we draw a n-dimensional sphere of radius epsilon around the point (if you have n-dimensional data).
Step 2: If the number of points inside the sphere is larger than min_samples, we set the center of the sphere as a cluster, and all the points within the sphere are belong to this cluster.
Step 3: Loop through all the points within the sphere with the above 2 steps, and expand the cluster whenever it satisfy the 2 rules.
Step 4: For the points not belong to any cluster, you can ignore them, or treat them as outliers.
The original paper about DBSCAN was published 10 years ago in 1996, and can be found here. It is currently ranked the 41st place in the most cited publication in data mining. If you find the paper is too heavy on defining different points, you can check this very nice video on youtube shows how this works: Here. Scikit learn already has a very nice example to show the effectiveness of the algorithm. You can find the example here

In the following, we will cluster the Loma Prieta earthquake aftershocks using DBSCAN. You can find the code at Qingkai's Github

Grab the earthquake data

%pylab inline
import pandas as pd
import numpy as np
import folium
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
Earthquake data can be get from the ANSS catalog. For simplicity, I just download aftershock data above Magnitude 2.0 for one week after the Loma Prieta earthquake. All together, there are 1009 earthquakes in this region.
# Read in earthquake data
df = pd.read_csv('./LomaPrieta_aftershocks_1week_above_2.csv', skiprows = 7)

# Get the latitude and logitude of the earthquakes
coords = df.as_matrix(columns=['Latitude', 'Longitude'])
# Plot the location of the earthquakes
plt.figure(figsize = (12, 12))

m = Basemap(projection='merc', resolution='l', epsg = 4269,
llcrnrlon=-122.7,llcrnrlat=36.2, urcrnrlon=-120.8,urcrnrlat=37.5)

# plot the aftershock
x, y = m(coords[:, 1], coords[:, 0])
m.scatter(x,y,5,marker='o',color='b')
m.arcgisimage(service='World_Shaded_Relief', xpixels = 5000, verbose= False)

plt.show()

Cluster and see the results

We are now using the DBSCAN from the sklearn. I followed Geoff Boeing's blog to cluster the geo-spatial data using the metrics haversine. I choose the epsilon roughly 1.5 km, and the min_samples = 5. We can see that DBSCAN detected 9 clusters in different colors (note that the black dots are identified as outliers).
from sklearn.cluster import DBSCAN
from sklearn import metrics

# Run the DBSCAN from sklearn
db = DBSCAN(eps=epsilon, min_samples=5, algorithm='ball_tree', \

cluster_labels = db.labels_
n_clusters = len(set(cluster_labels))

# get the cluster
# cluster_labels = -1 means outliers
clusters = \
pd.Series([coords[cluster_labels == n] for n in range(-1, n_clusters)])
import matplotlib.cm as cmx
import matplotlib.colors as colors

# define a helper function to get the colors for different clusters
def get_cmap(N):
'''
Returns a function that maps each index in 0, 1, ... N-1 to a distinct
RGB color.
'''
color_norm  = colors.Normalize(vmin=0, vmax=N-1)
scalar_map = cmx.ScalarMappable(norm=color_norm, cmap='nipy_spectral')
def map_index_to_rgb_color(index):
return scalar_map.to_rgba(index)
return map_index_to_rgb_color
plt.figure(figsize = (12, 12))
m = Basemap(projection='merc', resolution='l', epsg = 4269,
llcrnrlon=-122.7,llcrnrlat=36.2, urcrnrlon=-120.8,urcrnrlat=37.5)

unique_label = np.unique(cluster_labels)

# get different color for different cluster
cmaps = get_cmap(n_clusters)

# plot different clusters on map, note that the black dots are
# outliers that not belone to any cluster.
for i, cluster in enumerate(clusters):
lons_select = cluster[:, 1]
lats_select = cluster[:, 0]
x, y = m(lons_select, lats_select)
m.scatter(x,y,5,marker='o',color=cmaps(i), zorder = 10)

m.arcgisimage(service='World_Shaded_Relief', xpixels = 5000, verbose= False)

plt.show()