Abstract:
|
It is increasingly common to collect heterogeneous data sets from multiple sources for a common set of subjects in modern biomedical research. However, data from different sources may be heterogeneous and each data set may be high dimensional, making it impractical to carry out analysis (such as predictive modeling or clustering) directly using the original data. Dimension reduction can reduce the magnitude and complexity of data, and integrate multiple data into the same space. In this paper, we introduce Supervised Integrated Principal Component Analysis (SIPCA), a new computational tool for integration and reduction of multi-source data. The method explicitly captures joint and individual structures across multiple primary data sources. Moreover, when there are auxiliary data driving the underlying structures, SIPCA specifically accounts for the auxiliary information through a latent variable model. It substantially improves interpretability of reduced data over existing dimension reduction methods. We demonstrate the advantage of SIPCA using a multi-tissue genetic study and a pediatric growth study.
|