Abstract:
|
Statistical analyses of high-throughput sequencing data have reshaped the biological sciences. In spite of myriad advances, recovering interpretable biological signal from data corrupted by technical noise remains a prevalent open problem. Several classes of procedures, among them classical dimensionality reduction techniques and others incorporating subject-matter knowledge, have provided effective advances. However, no procedure currently satisfies the dual objectives of recovering interpretable and relevant features simultaneously. Inspired by recent proposals for making use of control data in the removal of unwanted variation, we propose a variant of principal component analysis that extracts sparse and relevant biological signal. Indeed, this novel method was found to produce more informative and interpretable embeddings than linear (e.g. PCA, contrastive PCA, sparse PCA) and non-linear dimensionality reduction methods (e.g. UMAP, t-SNE) commonly used to explore high-dimensional biological data. We demonstrate this through the re-analysis of publicly available protein expression, microarray gene expression, and single-cell transcriptome sequencing datasets.
|