Abstract:
|
We propose a novel methodology for variable screening in clustering large scale datasets which not only have very large sample sizes but are also high-dimensional. Using a fusion penalty based convex clustering criterion, we propose a very fast screening procedure which efficiently discards non-informative variables from the data. We establish asymptotic optimality properties of our proposed method. Through extensive simulation experiments, we compare the performance of our proposed method with other clustering algorithms and obtain encouraging results. We demonstrate the applicability of our method for cluster analysis of big datasets arising in single-cell proteomic and genomic studies.
|