Abstract:
|
Probabilistic Principal Component Analysis is frequently used on noisy data for pre-processing. Though the number of principal components (PCs) provides insight into the complexity of sample dependence, cluster assignments based on PCs do not always perform well as noise in the data can weaken the degree of clusters separation. We previously proposed a penalized profile log-likelihood criterion to select the effective dimension of high-dimensional data. Here we take advantage of the learned representation and propose to train classification models in the projection space. We illustrate via simulations that this approach requires less training data, leads to faster computation for multiple classification algorithms. The proposed method was used on NCI 60 cell-line data to classify tumor types. On 30% and 50% training samples, we recorded 85% and 94% prediction accuracy using svm. In contrast, classification based on original data yielded 79% and 92% accuracy, on 30% and 50% training samples, respectively. Our approach is able to leverage the molecular variations for tens of thousands of genes simultaneously to produce accurate tumor classifications quickly.
|