Abstract:
|
Consider two seemingly unrelated but connected problems: clustering with gene microarrays and network community detection. In both problems, we view the data matrix as the sum of a low rank signal matrix and a noise matrix (the former contains the desired information of the class labels). Classical PCA is a well-known approach but faces challenges. We propose two new PCA approaches: IF-PCA and SCORE, to attack each of the two problems. In IF-PCA, we carefully select a small fraction of features, and apply PCA with only the selected features. In SCORE, we obtain the first a few leading eigenvectors of the data matrix, take entry-wise ratios between each of such vectors and the first one, and clustering with the resultant matrix by applying the classical k-means. Both procedures are fast, conceptually simple, easy-to-implement, and yet, provably effective. We have applied IF-PCA to 10 gene microarray data sets, and SCORE to Coauthorship and Citation networks for statisticians---two data sets we have recently collected and cleaned. Both methods compare favorably over existing approaches. We explain why the procedures work, and carefully justify their advantages theoretically.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.