Friday, February 16  
CS07 Exploring Big Data 
Fri, Feb 16, 11:00 AM  12:30 PM
Salon D 
Exploratory Data Structure Comparisons by Use of Principal Component Analysis (303545)Karl Bang Christensen, Section of Biostatistics, Department of Public Health, University of CopenhagenBo Markussen, Department of Mathematical Sciences, University of Copenhagen *Anne Helby Petersen, Biostatistics, University of Copenhagen Keywords: principal component analysis, data structure comparisons, R, exploratory data analysis, covariance matrix A lot of statistical datasets are somehow divided into two distinct subsets, e.g. due to multicenter sampling. Often, one wishes to combine such data into one and conduct only a single analysis, but this is only meaningful if the two subsets are similar in structures. However, procedures for assessing whether such a data collapsing is allowed are usually very ad hoc and require making full model assumptions. This is fundamentally problematic, as it results in a recursive process, where the model choice hinges on the structure of the data, while the structure of the data is evaluated by the chosen model. This implies high risks of overfitting, and of obtaining parameter estimates that do not have the distributional properties implied by textbook statistics. In this presentation, a more systematic approach to data structure comparisons is proposed, focusing on the structure of the covariance matrix. By use of principal component analysis, covariance matrices can be visualized for intuitive, exploratory data structure comparisons without any model assumptions. Three visual tools from the Rpackage PCADSC are presented as a candidate method for exploratory data structure comparisons.
