Abstract:
|
Large population studies are now increasingly common in the field of genetics, neuroscience, economics etc. These datasets possess a large number of samples and features, their size can often exceed the available RAM and additionaly, they have their own subject specific complexities e.g., data type and missing data. Consequently, the softwares available to run the statistical learning procedure of our choice might not be able to handle these complex challenges. For instance, the R package "glmnet" does not support long vectors, and R packages like "biglasso" and "bigstatsr" which are designed for large datasets have limited functionality e.g., "biglasso" does not support multivariate lasso regression. Alternatively, we could reduce the sample and the feature size but need to be careful as over reduction can lead to a loss of information. At this roundtable, we plan to discuss the major methodological and computational challenges as well as the key lessons learnt while analyzing such datasets.
|