Abstract:
|
Penalized regression models such as Lasso have been extensively applied to analyzing high-dimensional data sets. However, due to memory limitations, existing R packages like glmnet are not capable of fitting Lasso models for ultrahigh-dimensional, multi-gigabyte data sets that are increasingly seen in many areas such as genetics, biomedical imaging, and high-frequency finance. In this study, we implement an R package called biglasso that enables to tackle this challenge. Built upon existing APIs, biglasso utilizes memory-mapped files to store the massive data on the disk and read those into memory whenever necessary during model fitting. Benchmarking experiments demonstrate that our biglasso package, as compared to package glmnet, is roughly equivalent in terms of computation speed but is much more memory-efficient. This advantage opens doors for carrying out powerful big data analysis procedures on an ordinary laptop. We further demonstrate the capability of our package in analyzing massive data sets that cannot be accommodated by existing R packages using real data from large-scale genome-wide association studies of prematurity and its complications.
|