Abstract:
|
We consider a clustering problem where we observe N feature vectors from K possible classes, each feature vector with length P. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime that P >> N, where classical clustering methods face challenges. We propose Important Features PCA (IF-PCA) as a new clustering procedure. In IF-PCA, we select a small fraction of features with the largest Kolmogorov-Smirnov (KS) scores, obtain the first (K ?1) left singular vectors of the post-selection normalized data matrix, and then estimate the labels by applying the classical k-means to these singular vectors. The threshold is set in a data-driven fashion by adapting the recent notion of Higher Criticism. As a result, IF-PCA is a tuning free clustering method. IF-PCA is applied to 10 gene microarray data sets. The method has competitive performance in clustering.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.