Online Program Home
My Program

Abstract Details

Activity Number: 528 - Analysis of Big Data
Type: Contributed
Date/Time: Wednesday, August 1, 2018 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #329481 Presentation
Title: Hybridized Threshold Clustering for Massive Data
Author(s): Jianmei Luo* and Michael Higgins and William Henry Hsu and ChandraVyas Annakula and Aruna Sai Kannamareddy
Companies: KANSAS STATE UNIVERSITY and KANSAS STATE UNIVERSITY and KANSAS STATE UNIVERSITY and KANSAS STATE UNIVERSITY and KANSAS STATE UNIVERSITY
Keywords: Hybridized clustering; Threshold clustering; Prediction; Massive data
Abstract:

As the size of datasets become massive, many commonly-used clustering algorithms (e.g. k-means clustering, hierarchical agglomerative clustering (HAC), etc.) require prohibitive computational cost and memory. Threshold clustering (TC) is a recently-developed efficient clustering method designed so that, given a pre-specified threshold t, each cluster contains at least t units and the largest within-cluster dissimilarity is small. We introduce hybridized threshold clustering (HTC), in which TC is used as a pre-processing step to aid in the scalability and robustness of popular clustering algorithms. TC is initially used to partition all n units into small groups. Prototype units are then formed by averaging units within each group, and the desired clustering algorithm can then applied to the prototype dataset. Each individual unit can be assigned to the cluster corresponding to their prototype, thereby obtaining a clustering on all n units. We give simulation results that show that HTC combined with k-means or HAC reduces the run time and memory usage of the original clustering algorithms while still preserving their performance.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2018 program