Understanding Distance Metrics in Hierarchical Clustering: A Comparative Study

Raghda Al taei
3 min readOct 17, 2024

--

Made by the Author

Hierarchical clustering is a powerful method for grouping data based on their similarities. The choice of distance metric significantly influences the clustering results. Among the most common metrics used to calculate the distance between clusters are Single Linkage, Complete Linkage, and Average Linkage. This article will compare these three metrics in terms of time complexity and sensitivity to outliers, along with a brief explanation of each method.

Single Linkage

Definition: Single Linkage (also known as Minimum Linkage) measures the shortest distance between two clusters. Specifically, it considers the minimum distance between any pair of points from each cluster.

Time Complexity

The time complexity of Single Linkage is O(n²)for computing the distance matrix. This is because it requires calculating the distances between all pairs of data points, leading to O(n²) computations. The overall complexity for clustering can reach O(n³) due to the hierarchical merging process.

Sensitivity to Outliers

Single Linkage is sensitive to outliers. Since it uses the minimum distance to merge clusters, the presence of an outlier can significantly affect the resulting cluster structure. An outlier may get linked to a nearby cluster, potentially distorting the true cluster shape.

Complete Linkage

Definition: Complete Linkage (or Maximum Linkage) measures the maximum distance between any two points in the clusters. It uses the farthest pair of points to determine the distance between two clusters.

Time Complexity

The time complexity for Complete Linkage is alsoO(n²)for the initial distance calculation. However, the merging process leads to an overall complexity of O(n³), similar to Single Linkage. The need to compute all pairwise distances for every cluster merge contributes to this complexity.

Sensitivity to Outliers

Complete Linkage is less sensitive to outliers compared to Single Linkage. Since it focuses on the maximum distance between clusters, an outlier’s effect on the overall distance measurement is diminished. This often results in more compact and well-defined clusters, making Complete Linkage more robust in the presence of outliers.

Average Linkage

Definition: Average Linkage computes the average distance between all pairs of points in two clusters. It provides a compromise between Single and Complete Linkage by considering all pairwise distances rather than just the extremes.

Time Complexity

Average Linkage has a time complexity of O(n²) for the initial distance calculation. The merging process also leads to an overall complexity of O(n³). However, due to its averaging nature, it might perform slightly better in practice than Single or Complete Linkage because it can converge to solutions faster under certain conditions.

Sensitivity to Outliers

Average Linkage is moderately sensitive to outliers. While it considers more points than Single Linkage, the averaging can still be skewed by extreme values. However, it tends to be more robust than Single Linkage and offers a better balance between sensitivity and clustering quality.

Summary Comparison

Conclusion

Choosing the appropriate distance metric for hierarchical clustering is critical, especially in datasets that may contain outliers. Single Linkage can create elongated and skewed clusters and is highly influenced by outliers. In contrast, Complete Linkage offers a more robust clustering solution, while Average Linkage strikes a balance between the two. Understanding these differences allows data scientists to select the most suitable approach based on the nature of their data and the specific clustering goals.

--

--

Raghda Al taei
Raghda Al taei

Written by Raghda Al taei

Data Scientist/Analyst Specialist | Johns Hopkins University Master’s Degree in Computer Engineering (AI) | Amirkabir University

No responses yet