Abstract:
|
Canonical correlation analysis (CCA) provides a global test and measure of association between two multivariate sets of variables measured on the same individuals. In large multivariate settings, the proportion of subjects missing data on at least one variable can be high. Before performing CCA in practice, missing data has typically been handled by complete case analysis, unconditional mean imputation, or k-nearest neighbors approaches. For each of these methods as well as more sophisticated imputation methods, we examine bias of the first canonical correlation and power of a test of association between the two sets of variables. Even when the data are MCAR, bias is quite large in complete case analysis due to the strong link between sample size and bias in CCA. Surprisingly, tree-based imputation does not outperform naive single imputation methods. We present advances in performing multiple imputation, which is nontrivial due to the lack of a likelihood function in CCA. We offer recommendations for imputation in CCA based on simulated data with wide-ranging complexity, and we apply these methods to relate dietary variables to blood lipid levels.
|