Abstract:
|
In the age of big data, one problem of the utmost concern in statistical analyses is the oversaturated model, or where the number of covariates p exceeds the number of observations n. In this setting, simple statistical models can suffer myriad problems in accuracy and stability. To combat this problem, statisticians often either add constraints (e.g., LASSO) or inject prior information (e.g., Horseshoe prior).
We propose a new technique called Grouped Covariate Regression (GCR), a combination data mining and hierarchical modeling strategy that has two primary steps: (1) Group the variables and summarize, and (2) Model a response given the group summaries. There are several ways to implement each step, but for this paper a deterministic approach for grouping variables is considered along with an errors-in-variables model for assessing a response.
Simulation studies and an application to a biodiversity problem show many of the advantages of GCR. These advantages include (1) Overcoming the oversaturated model problem, (2) Lowering correlation between predictors in the model, (3) Overcoming issues presented by sparse matrices, and (4) Retaining interpretability for the coefficients.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.