Abstract:
|
A key challenge to effective analyses of high-dimensional data is finding a low-dimensional, signal-rich subspace in the ambient space defined by the data. For linear subspaces, this is generally performed by decomposing the design matrix into orthogonal components, and then retaining those components with sufficient variation. The number of components retained is generally determined using ad-hoc approaches such as plotting the decreasing pattern of the eigenvalues and looking for the ”elbow” in the plot. While these approaches have been shown effective, a poorly calibrated heuristic or misjudgment in the case of choosing the elbow can result in an overabundance of noise or an underabundance of predictive information in the low-dimensional space. Here we propose a procedure to estimate the rank of a matrix thereby retaining components with variations greater than those of a random matrix, of which the eigenvalues follow a universal Marcenko-Pastur distribution. In addition, we also demonstrated the efficiency, scalability, and robustness of our novel dimension determination procedure in simulated and real data, and compared its performance to previous methods.
|