Qingkai's Blog: Clustering with DBSCAN

I am currently checking out a clustering algorithm: DBSCAN (Density-Based Spatial Clustering of Application with Noise). As the name suggested, it is a density based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), and marks points as outliers if they lie alone in low-density regions. It has many advantages, including no need specify the number of clusters, can find arbitrary shaped clusters, relatively fast, etc. Of course, there's no single algorithm can do everything, DBSCAN has disadvantage as well.

The algorithm has two parameters: epsilon and min_samples, the advantage of the algorithm is that you don’t have to specify how many clusters you need, it can find all the clusters that satisfy the requirement. For the disadvantage, it is very sensitive to the parameter you choose.

The summary of this algorithm is:
Step 1: For each point in the dataset, we draw a n-dimensional sphere of radius epsilon around the point (if you have n-dimensional data).
Step 2: If the number of points inside the sphere is larger than min_samples, we set the center of the sphere as a cluster, and all the points within the sphere are belong to this cluster.
Step 3: Loop through all the points within the sphere with the above 2 steps, and expand the cluster whenever it satisfy the 2 rules.
Step 4: For the points not belong to any cluster, you can ignore them, or treat them as outliers.

The original paper about DBSCAN was published 10 years ago in 1996, and can be found here. It is currently ranked the 41st place in the most cited publication in data mining. If you find the paper is too heavy on defining different points, you can check this very nice video on youtube shows how this works: Here. Scikit learn already has a very nice example to show the effectiveness of the algorithm. You can find the example here.

In the following, we will cluster the Loma Prieta earthquake aftershocks using DBSCAN. You can find the code at Qingkai's Github.

Grab the earthquake data

%pylab inline  
import pandas as pd
import numpy as np
import folium
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

Earthquake data can be get from the ANSS catalog. For simplicity, I just download aftershock data above Magnitude 2.0 for one week after the Loma Prieta earthquake. All together, there are 1009 earthquakes in this region.

# Read in earthquake data
df = pd.read_csv('./LomaPrieta_aftershocks_1week_above_2.csv', skiprows = 7)

# Get the latitude and logitude of the earthquakes
coords = df.as_matrix(columns=['Latitude', 'Longitude'])

# Plot the location of the earthquakes
plt.figure(figsize = (12, 12))

m = Basemap(projection='merc', resolution='l', epsg = 4269, 
            llcrnrlon=-122.7,llcrnrlat=36.2, urcrnrlon=-120.8,urcrnrlat=37.5)

# plot the aftershock
x, y = m(coords[:, 1], coords[:, 0])
m.scatter(x,y,5,marker='o',color='b')
m.arcgisimage(service='World_Shaded_Relief', xpixels = 5000, verbose= False)
    
plt.show()

Cluster and see the results

We are now using the DBSCAN from the sklearn. I followed Geoff Boeing's blog to cluster the geo-spatial data using the metrics haversine. I choose the epsilon roughly 1.5 km, and the min_samples = 5. We can see that DBSCAN detected 9 clusters in different colors (note that the black dots are identified as outliers).

from sklearn.cluster import DBSCAN
from sklearn import metrics

kms_per_radian = 6371.0088
epsilon = 1.5 / kms_per_radian

# Run the DBSCAN from sklearn
db = DBSCAN(eps=epsilon, min_samples=5, algorithm='ball_tree', \
            metric='haversine').fit(np.radians(coords))

cluster_labels = db.labels_
n_clusters = len(set(cluster_labels))

# get the cluster
# cluster_labels = -1 means outliers
clusters = \
    pd.Series([coords[cluster_labels == n] for n in range(-1, n_clusters)])

import matplotlib.cm as cmx
import matplotlib.colors as colors

# define a helper function to get the colors for different clusters
def get_cmap(N):
    '''
    Returns a function that maps each index in 0, 1, ... N-1 to a distinct 
    RGB color.
    '''
    color_norm  = colors.Normalize(vmin=0, vmax=N-1)
    scalar_map = cmx.ScalarMappable(norm=color_norm, cmap='nipy_spectral') 
    def map_index_to_rgb_color(index):
        return scalar_map.to_rgba(index)
    return map_index_to_rgb_color

plt.figure(figsize = (12, 12))
m = Basemap(projection='merc', resolution='l', epsg = 4269, 
        llcrnrlon=-122.7,llcrnrlat=36.2, urcrnrlon=-120.8,urcrnrlat=37.5)

unique_label = np.unique(cluster_labels)

# get different color for different cluster
cmaps = get_cmap(n_clusters)

# plot different clusters on map, note that the black dots are 
# outliers that not belone to any cluster. 
for i, cluster in enumerate(clusters):
    lons_select = cluster[:, 1]
    lats_select = cluster[:, 0]
    x, y = m(lons_select, lats_select)
    m.scatter(x,y,5,marker='o',color=cmaps(i), zorder = 10)

m.arcgisimage(service='World_Shaded_Relief', xpixels = 5000, verbose= False)

plt.show()

2 comments:

AnonymousApril 28, 2021 at 7:19 AM
Five weeks ago my boyfriend broke up with me. It all started when i went to summer camp i was trying to contact him but it was not going through. So when I came back from camp I saw him with a young lady kissing in his bed room, I was frustrated and it gave me a sleepless night. I thought he will come back to apologies but he didn't come for almost three week i was really hurt but i thank Dr.Azuka for all he did i met Dr.Azuka during my search at the internet i decided to contact him on his email dr.azukasolutionhome@gmail.com he brought my boyfriend back to me just within 48 hours i am really happy. What’s app contact : +44 7520 636249‬
Vale Co XeniaJuly 19, 2024 at 6:51 AM
DBSCAN is a powerful clustering algorithm that excels at discovering clusters of arbitrary shape and handling noise in data. Unlike distance-based methods like K-Means, DBSCAN groups data points based on their density.

Machine Learning Projects For Final Year

How DBSCAN Works
Core Points: A point is considered a core point if it has at least min_samples points within a radius eps.
Border Points: A point is a border point if it is within the eps neighborhood of a core point but doesn't have enough points within its own eps neighborhood to be a core point.
Noise Points: Points that are neither core nor border points are considered noise or outliers.
Cluster Formation: DBSCAN expands clusters from core points by recursively including their density-connected neighbors.
Key Parameters

Deep Learning Projects for Final Year Students

eps (epsilon): The radius of the neighborhood to consider.
min_samples: The minimum number of points required to form a dense region.
Advantages of DBSCAN
Discovers clusters of arbitrary shape: DBSCAN can handle clusters with complex shapes, unlike K-Means which assumes spherical clusters.
Handles noise effectively: DBSCAN can identify and exclude noise points.
Does not require specifying the number of clusters: DBSCAN automatically determines the number of clusters based on data density.

python projects for final year students

Sunday, August 7, 2016

Clustering with DBSCAN

Grab the earthquake data

Cluster and see the results

2 comments: