Abstract:
|
Robust principal components are particularly challenging to find for high-dimensional data sets, including genomic data. Conventional principal component analysis is often unduly influenced by a few closely related family members. This phenomenon is explained using the ideas of a high-dimensional low sample size geometric representation. These ideas further show why the earlier robust method of spherical principal components fails to solve this problem. A solution is provided, which is called the visual L1 principal component analysis (VL1PCA). This approach is based on a backwards L1-norm best-fit idea. VL1PCA improves upon the best previous version of L1PCA by providing interpretable scores and a scatterplot visualization of the data. Another contribution is a new notion of robust centre, the backwards L1 median. The utility of VL1PCA is illustrated on examples and a real high-dimensional data set. Our VL1PCA is not only robust to outliers but also gives a meaningful population stratification for data even in the presence of special family structure, when other methods fail.
|