Online Program

Return to main conference page
Friday, February 16
CS07 Exploring Big Data Fri, Feb 16, 11:00 AM - 12:30 PM
Salon D

Exploratory Data Structure Comparisons by Use of Principal Component Analysis (303545)

View Presentation View Presentation

Karl Bang Christensen, Section of Biostatistics, Department of Public Health, University of Copenhagen 
Bo Markussen, Department of Mathematical Sciences, University of Copenhagen 
*Anne Helby Petersen, Biostatistics, University of Copenhagen 

Keywords: principal component analysis, data structure comparisons, R, exploratory data analysis, covariance matrix

A lot of statistical datasets are somehow divided into two distinct subsets, e.g. due to multi-center sampling. Often, one wishes to combine such data into one and conduct only a single analysis, but this is only meaningful if the two subsets are similar in structures. However, procedures for assessing whether such a data collapsing is allowed are usually very ad hoc and require making full model assumptions. This is fundamentally problematic, as it results in a recursive process, where the model choice hinges on the structure of the data, while the structure of the data is evaluated by the chosen model. This implies high risks of overfitting, and of obtaining parameter estimates that do not have the distributional properties implied by textbook statistics. In this presentation, a more systematic approach to data structure comparisons is proposed, focusing on the structure of the covariance matrix. By use of principal component analysis, covariance matrices can be visualized for intuitive, exploratory data structure comparisons without any model assumptions. Three visual tools from the R-package PCADSC are presented as a candidate method for exploratory data structure comparisons.