Hierarchical Clustering Distance Metrics: A Step-by-Step Numerical Example

Raghda (Merry) Al taei
4 min readOct 17, 2024

--

made by the author

Hierarchical clustering is a popular unsupervised learning technique that creates a hierarchy of clusters. This method is particularly effective for exploratory data analysis. In this article, we will explore three distance metrics used in hierarchical clustering: Single Linkage, Complete Linkage, and Average Linkage. We’ll go through a detailed numerical example, illustrating each method’s calculations step by step.

The Dataset

Let’s consider a simple dataset containing five points in a 2D space:

Made by Author

Step 1: Calculate Pairwise Distances

We will calculate the Euclidean distance between each pair of points using the formula:

Here are the calculated distances:

Step 2: Hierarchical Clustering

1. Single Linkage

Definition: The distance between two clusters is defined as the minimum distance between points in the two clusters.

Merging Process:

  1. Merge A and B (distance = 1.41).
  2. New clusters: {AB}, C, D, E. Update the distance matrix.
  3. Calculate distances from cluster {AB} to remaining points:
  • d(AB,C)=min⁡(d(A,C),d(B,C))=min⁡(2.24,2.24)=2.24
  • d(AB,D)=min⁡(d(A,D),d(B,D))=min⁡(4.47,3.16)=3.16
  • d(AB,E)=min⁡(d(A,E),d(B,E))=min⁡(5.83,4.47)=4.47

4-Merge {AB} and C (distance = 2.24).

5-New clusters: {ABC}, D, E. Update distances:

  • d(ABC,D)=min⁡(d(AB,D),d(C,D))=min⁡(3.16,3.61)=3.16
  • d(ABC,E)=min⁡(d(AB,E),d(C,E))=min⁡(4.47,5.00)=4.47

6-Merge D and E (distance = 1.41).

7- New clusters: {ABC}, {DE}.

8- Merge {ABC} and {DE} (distance = max distance between D and E).

Final Cluster Structure: {A, B, C}, {D, E}

Made by author

2. Complete Linkage

Definition: The distance between two clusters is defined as the maximum distance between points in the two clusters.

Merging Process:

  1. Merge A and B (distance = 1.41).
  2. New clusters: {AB}, C, D, E. Update distances.
  3. Calculate distances from cluster {AB} to remaining points:
  • d(AB,C)=max⁡(d(A,C),d(B,C))=max⁡(2.24,2.24)=2.24
  • d(AB,D)=max⁡(d(A,D),d(B,D))=max⁡(4.47,3.16)=4.47
  • d(AB,E)=max⁡(d(A,E),d(B,E))=max⁡(5.83,4.47)=5.83

4- Merge {AB} and C (distance = 2.24).

5- New clusters: {ABC}, D, E. Update distances:

  • d(ABC,D)=max⁡(d(AB,D),d(C,D))=max⁡(4.47,3.61)=4.47
  • d(ABC,E)=max⁡(d(AB,E),d(C,E))=max⁡(5.83,5.00)=5.83

6-Merge D and E (distance = 1.41).

7- New clusters: {ABC}, {DE}.

8- Merge {ABC} and {DE} (distance = max distance between D and E).

Final Cluster Structure: {A, B, C}, {D, E}

Made by author

3. Average Linkage

Definition: The distance between two clusters is defined as the average distance between points in the two clusters.

Merging Process:

  1. Merge A and B (distance = 1.41).
  2. New clusters: {AB}, C, D, E. Update distances.
  3. Calculate distances from cluster {AB} to remaining points:

4- Merge {AB} and C (distance = 2.24).

5-New clusters: {ABC}, D, E. Update distances:

6- Merge D and E (distance = 1.41).

7- New clusters: {ABC}, {DE}.

8- Merge {ABC} and {DE} (distance = average of distances).

Final Cluster Structure: {A, B, C}, {D, E}

Made by author

Conclusion

In this detailed numerical example, we illustrated how different distance metrics affect the hierarchical clustering process. While the cluster formations may be similar with a simple dataset, the choice of linkage method can significantly impact the results, especially in more complex datasets. Understanding these metrics helps practitioners make informed decisions when selecting hierarchical clustering methods for their specific needs.

--

--

Raghda (Merry) Al taei
Raghda (Merry) Al taei

Written by Raghda (Merry) Al taei

I am a Data Scientist/Analyst with a master's degree in computer engineering (AI) from AmirKabir University.

No responses yet