Abstract:
|
As the size of datasets become massive, many commonly-used clustering algorithms (e.g. k-means clustering, hierarchical agglomerative clustering (HAC), etc.) require prohibitive computational cost and memory. Threshold clustering (TC) is a recently-developed efficient clustering method designed so that, given a pre-specified threshold t, each cluster contains at least t units and the largest within-cluster dissimilarity is small. We introduce hybridized threshold clustering (HTC), in which TC is used as a pre-processing step to aid in the scalability and robustness of popular clustering algorithms. TC is initially used to partition all n units into small groups. Prototype units are then formed by averaging units within each group, and the desired clustering algorithm can then applied to the prototype dataset. Each individual unit can be assigned to the cluster corresponding to their prototype, thereby obtaining a clustering on all n units. We give simulation results that show that HTC combined with k-means or HAC reduces the run time and memory usage of the original clustering algorithms while still preserving their performance.
|