Hierarchical Clustering Distance Metrics: A Step-by-Step Numerical Example
Hierarchical clustering is a popular unsupervised learning technique that creates a hierarchy of clusters. This method is particularly effective for exploratory data analysis. In this article, we will explore three distance metrics used in hierarchical clustering: Single Linkage, Complete Linkage, and Average Linkage. We’ll go through a detailed numerical example, illustrating each method’s calculations step by step.
The Dataset
Let’s consider a simple dataset containing five points in a 2D space:
Step 1: Calculate Pairwise Distances
We will calculate the Euclidean distance between each pair of points using the formula:
Here are the calculated distances:
Step 2: Hierarchical Clustering
1. Single Linkage
Definition: The distance between two clusters is defined as the minimum distance between points in the two clusters.
Merging Process:
- Merge A and B (distance = 1.41).
- New clusters: {AB}, C, D, E. Update the distance matrix.
- Calculate distances from cluster {AB} to remaining points:
- d(AB,C)=min(d(A,C),d(B,C))=min(2.24,2.24)=2.24
- d(AB,D)=min(d(A,D),d(B,D))=min(4.47,3.16)=3.16
- d(AB,E)=min(d(A,E),d(B,E))=min(5.83,4.47)=4.47
4-Merge {AB} and C (distance = 2.24).
5-New clusters: {ABC}, D, E. Update distances:
- d(ABC,D)=min(d(AB,D),d(C,D))=min(3.16,3.61)=3.16
- d(ABC,E)=min(d(AB,E),d(C,E))=min(4.47,5.00)=4.47
6-Merge D and E (distance = 1.41).
7- New clusters: {ABC}, {DE}.
8- Merge {ABC} and {DE} (distance = max distance between D and E).
Final Cluster Structure: {A, B, C}, {D, E}
2. Complete Linkage
Definition: The distance between two clusters is defined as the maximum distance between points in the two clusters.
Merging Process:
- Merge A and B (distance = 1.41).
- New clusters: {AB}, C, D, E. Update distances.
- Calculate distances from cluster {AB} to remaining points:
- d(AB,C)=max(d(A,C),d(B,C))=max(2.24,2.24)=2.24
- d(AB,D)=max(d(A,D),d(B,D))=max(4.47,3.16)=4.47
- d(AB,E)=max(d(A,E),d(B,E))=max(5.83,4.47)=5.83
4- Merge {AB} and C (distance = 2.24).
5- New clusters: {ABC}, D, E. Update distances:
- d(ABC,D)=max(d(AB,D),d(C,D))=max(4.47,3.61)=4.47
- d(ABC,E)=max(d(AB,E),d(C,E))=max(5.83,5.00)=5.83
6-Merge D and E (distance = 1.41).
7- New clusters: {ABC}, {DE}.
8- Merge {ABC} and {DE} (distance = max distance between D and E).
Final Cluster Structure: {A, B, C}, {D, E}
3. Average Linkage
Definition: The distance between two clusters is defined as the average distance between points in the two clusters.
Merging Process:
- Merge A and B (distance = 1.41).
- New clusters: {AB}, C, D, E. Update distances.
- Calculate distances from cluster {AB} to remaining points:
4- Merge {AB} and C (distance = 2.24).
5-New clusters: {ABC}, D, E. Update distances:
6- Merge D and E (distance = 1.41).
7- New clusters: {ABC}, {DE}.
8- Merge {ABC} and {DE} (distance = average of distances).
Final Cluster Structure: {A, B, C}, {D, E}
Conclusion
In this detailed numerical example, we illustrated how different distance metrics affect the hierarchical clustering process. While the cluster formations may be similar with a simple dataset, the choice of linkage method can significantly impact the results, especially in more complex datasets. Understanding these metrics helps practitioners make informed decisions when selecting hierarchical clustering methods for their specific needs.