Understand the Math Behind DBSCAN

Raghda Al taei
4 min readOct 19, 2024

--

BDSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering algorithm that groups data points based on their density. It works by defining clusters as dense regions separated by areas of lower density, which allows it to discover clusters of arbitrary shapes and handle noise effectively. Here, we’ll explore the math behind DBSCAN with a step-by-step manual example.

Parameters Recap:

  • eps: The maximum radius of a neighborhood around a point (i.e., how close points should be to each other to be considered neighbors). Here, eps = 2.0.
  • min_samples: The minimum number of points needed to form a dense region (including the point itself). Here, min_samples = 3.

Dataset:

We’ll work with the following set of points:

By the author
mathematicacodePoint A: (1, 2)
Point B: (2, 2)
Point C: (2, 3)
Point D: (8, 7)
Point E: (8, 8)
Point F: (25, 80)
Point G: (1, 3)
Point H: (0, 2)

Step-by-Step Calculation:

Step 1: Calculate Distances Between Points

To determine the neighbors of each point, we first calculate the distance between every pair of points using the Euclidean distance formula:

Step 2: Compute the Distance Matrix

Using the Euclidean formula, we create a distance matrix that shows the distance between each pair of points:

Step 3: Determine Neighbors Within eps

A point is considered a neighbor if its distance from another point is less than or equal to eps = 2.0. Based on this rule, we determine the neighbors of each point:

Step 4: Determine Core Points (With min_samples=3):

  • A core point must have at least 3 neighbors (including itself):
  • Core points: A, B, C, G (each has at least 3 neighbors).
  • Border points: H (not enough neighbors but close to a core point).
  • Non-core points: D, E, F (too few neighbors).

Step 5: Expand Clusters from Core Points

Start with Point A:

  • Assign Cluster 1.
  • Expand from A: Neighbors are A, B, G, H.
  • Add B and G to Cluster 1.
  • H is a border point, associated with Cluster 1.

Move to Point B:

  • Already in Cluster 1.
  • Expand from B: Add C to Cluster 1.

Move to Point C:

  • Already in Cluster 1, no new neighbors to add.

Move to Point D:

  • D is not a core point (only 2 neighbors), so it remains noise.

Move to Point E:

  • E is also not a core point, so it remains noise.

Move to Point F:

  • F is isolated, so it is considered noise.

Step 6: Final Result

After expanding clusters, the final results are:

  • Cluster 1: A, B, C, G.
  • Border Point: H (associated with Cluster 1).
  • Noise Points: D, E, F.

Summary of Results:

By the author
Cluster 1: {A, B, C, G}
Border Point: {H}
Noise: {D, E, F}

Key Takeaways:

  • Core Points: These form the dense regions of a cluster and have enough neighboring points.
  • Border Points: Close to core points but do not meet the density requirement themselves.
  • Noise: Isolated points that do not belong to any cluster.

This example demonstrates how DBSCAN clusters points based on density and distance, making it effective for discovering clusters of arbitrary shape and handling outliers. With a basic understanding of eps and min_samples, you can apply DBSCAN to real-world datasets, adjusting these parameters to fit the data's distribution.

--

--

Raghda Al taei
Raghda Al taei

Written by Raghda Al taei

Data Scientist/Analyst Specialist | Johns Hopkins University Master’s Degree in Computer Engineering (AI) | Amirkabir University

No responses yet