DBSCAN

Summarized notes from Introduction to Data Mining (Int'l ed), Chapter 5, section 4

Unsupervised learning > clustering

density-based clustering located regions of high density, separated from others by regions of low density
DBSCAN is based on center-based approach: count number of points within radius, of a selected central point
Data points are classified in 3 categories:
- core points: inside the interior of density-based cluster; point is a core point if there are at least $M i n P t s$ within specified distance, $E p s$
- border-points: non-core points within the neighborhood of core point; border point can be inside the neighborhood of several core points
- noise points: points that are neither core or border points
DBSCAN produces partial clustering solution

Categories of points

points

Pseudo code

DBSCAN(min_pts, eps):
    label each point as core border, or noise
    eliminate noise
    put edge between all core points within distance eps or each other
    make each group of connected core points into a separate cluster
    assign each border point to one the clusters of its associated core points

Choosing parameters

For $M i n P t s$ typical approach is to look at the behavior of distance of a point to its $k^{t h}$ neighbor
- for points that belong to cluster, the distance will be small if k is not larger than cluster size
compute distance for all data points for some $k$ , sort in increasing order, then plot sorted values
- distance increases sharply at suitable $E p s$
- use k as $M i n P t s$
if $k$ is too small then even small number of noise points can form a cluster
if $k$ is too large, then small clusters of size < $k$ will be labelled as noise
original DBSCAN used value $k = 4$ which is appropriate for many 2D data sets

Analysis

Complexity

$m$ - data points $t$ - time to find points


Time	$O (m \times t)$
Time, min	$O (m \log m)$
Time, max	$O (m^{2})$
Space	$O (m)$

Advantages

resistant to noise
can handle clusters of varying shapes and sizes => can find clusters not discoverable with K-means

Limitations

algorithm will have issues, if cluster densities vary widely
high-dimensionallity: difficult to define density for such data
computation can become expensive if dimensionality is high