Abstract:
|
Standard clustering builds models based on distance connectivity. We investigate an alternative clustering method using a so-called pseudo-supervised approach. This optimizes over all possible cluster partition of the data by scoring a partitioning as the accuracy of a machine learning (ML) algorithm in separating the clusters (now viewed as fixed classes) based on ML training and testing. However, this involves a very computationally intensive optimization. We have an algorithm that hybridizes the pseudo-supervised approach with standard clustering, using a graph-based cluster model. We take a large data set and divide it into n small clusters by standard clustering, and then aggregate these n clusters into m (m < n) larger clusters. The aggregation is done using a variant of the above pseudo-supervised method, by identifying a confusion matrix (using machine classifiers such as SVM and random forest) among the n classes obtained in the first clustering step, and using this as a basis for graph clustering. We discuss this algorithm theoretically, and apply it to classifying cancer data sets based on gene expression and spectral bio-marker data.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.