Conference Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 282 - Sampling and Ensembling in Statistical Computing
Type: Contributed
Date/Time: Tuesday, August 9, 2022 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistical Computing
Abstract #323458
Title: Volume Subsampling for Big Data Linear Regression
Author(s): Ethan Davis* and Jonathan Stallrich
Companies: North Carolina State University and North Carolina State University
Keywords: Massive data; Subsampling; Linear regression; Data reduction; D-optimality
Abstract:

The development of systems capable of collecting massive data has led to a concurrent need for methodologies to analyze this information. A common tool for handling large data is subsampling, where a subset of observations are used to perform analysis in place of the full data. This work focuses on optimal subsampling for ordinary least squares (OLS) regression. We introduce the sparse-Galil-Kiefer method (sGKM) for large data OLS subsampling. The method leverages a fast approximation to the row space of the data matrix to allow efficient use of a Galil-Kiefer type subsampling procedure. The sGKM algorithm provides the following advantages relative to current methods: (1) a user specified trade-off between speed and efficiency via the dimension of the low rank approximation to the data matrix (2) effective subsampling for high dimensional categorical covariates. To our knowledge, this work is the first to consider subsampling in the context of categorical coefficients for the large n OLS problem. We also provide simulation studies demonstrating sGKM's ability to produce higher quality subsamples relative to current methods on real valued as well as categorical data types.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2022 program