Abstract:
|
Principal component analysis is a ubiquitous method for discovering latent factors in data. Using it presents an important but challenging task: identify which components capture signal in the data, rather than noise. Parallel analysis via permutations is a popular approach with widespread use, empirical support, and recent work on its theoretical foundation using random matrix theory. In this approach, random permutations of the data provide a sort of null distribution for pure-noise eigenvalues; data eigenvalues greater than their “null”, i.e., noise, counterparts get selected. When the noise is heterogeneous, however, permutations can destroy the structure, significantly harming performance. This work proposes a new variant based on random signflips that addresses this shortcoming. Building on recent random matrix theoretic justifications for parallel analysis, we show that parallel analysis via signflips consistently selects perceptible components in certain high-dimensional and heterogeneous factor models; small signal components that do not separate from the noise are imperceptible and are not selected. Finally, we illustrate an application to single cell RNA sequencing data.
|