Activity Number:
|
282
- Sampling and Ensembling in Statistical Computing
|
Type:
|
Contributed
|
Date/Time:
|
Tuesday, August 9, 2022 : 10:30 AM to 12:20 PM
|
Sponsor:
|
Section on Statistical Computing
|
Abstract #323458
|
|
Title:
|
Volume Subsampling for Big Data Linear Regression
|
Author(s):
|
Ethan Davis* and Jonathan Stallrich
|
Companies:
|
North Carolina State University and North Carolina State University
|
Keywords:
|
Massive data;
Subsampling;
Linear regression;
Data reduction;
D-optimality
|
Abstract:
|
The development of systems capable of collecting massive data has led to a concurrent need for methodologies to analyze this information. A common tool for handling large data is subsampling, where a subset of observations are used to perform analysis in place of the full data. This work focuses on optimal subsampling for ordinary least squares (OLS) regression. We introduce the sparse-Galil-Kiefer method (sGKM) for large data OLS subsampling. The method leverages a fast approximation to the row space of the data matrix to allow efficient use of a Galil-Kiefer type subsampling procedure. The sGKM algorithm provides the following advantages relative to current methods: (1) a user specified trade-off between speed and efficiency via the dimension of the low rank approximation to the data matrix (2) effective subsampling for high dimensional categorical covariates. To our knowledge, this work is the first to consider subsampling in the context of categorical coefficients for the large n OLS problem. We also provide simulation studies demonstrating sGKM's ability to produce higher quality subsamples relative to current methods on real valued as well as categorical data types.
|
Authors who are presenting talks have a * after their name.