Online Program

Return to main conference page
Saturday, May 19
Machine Learning
Machine Learning for Complex Data
Sat, May 19, 10:30 AM - 12:00 PM
Grand Ballroom D
 

XPCA: Interval-Censored Copula Principal Component Analysis for Discrete and Continuous Features (304574)

*Clifford Anderson-Bergman, Sandia National Laboratories 
Kina Kincher-Winoto, Sandia National Labs 
Tamara G. Kolda, Sandia National Labs 

Keywords: PCA, machine learning, statistics, dimensionality reduction, imputation

In principal component analysis (PCA), we often face the problem of how to handle disparate feature variables. How can we compare a variable with a bimodal distribution to another with a heavy-tail distribution? Binary variables such as {0,1} and ordinal ratings such as {1, 2, 3, 4, 5} are often seen in the matrix of data, but are on largely different scales. The standard approach is to center and scale the data and we show that this is well justified under certain assumptions, including assuming Gaussian marginal distributions. We look to relax this particular assumption of Gaussian marginal distributions, which is clearly violated in the case of discrete data. Thus, we propose a novel method, which we call XPCA, to handle not only discrete data, but a mixture of variable types of unknown distributions. Our method combines the ideas of Copula Component Analysis (COCA), a semiparametric copula PCA model which uses nonparametric methods for estimating the marginal distributions of each column but treats each column as continuous, and interval censoring methods to properly account for discrete data types. The output of our method, similar to PCA, can be used to find latent structure in data, build predictive models, and perform dimensionality reduction. We show how to compute XPCA, and give an in-depth breakdown of how XPCA’s internals compare to PCA and COCA. Also, we show how to compute XPCA even in midst of incomplete data, as well as methods for estimating missing values. One of the benefits of the proposed method is that the estimates will never be outside the range of the input data. Additionally, we can provide a probability distribution for missing discrete values; this is particularly useful when imputing binary variables. Lastly, we apply XPCA to different real-world data sets to show the advantages of XPCA over PCA and COCA. *Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of...