Conference Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 15 - Subsampling: Basic Tool That Facilitates the Identification of Statistical Relationships in Big Data
Type: Topic Contributed
Date/Time: Sunday, August 7, 2022 : 2:00 PM to 3:50 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #322314
Title: Supervised Compression of Big Data
Author(s): Roshan V Joseph* and Simon Mak
Companies: Georgia Institute of Technology and Duke University
Keywords: Data reduction; Clustering; Experimental design; Subsampling
Abstract:

The phenomenon of big data has become ubiquitous in nearly all disciplines, from science to engineering. A key challenge is the use of such data for fitting statistical and machine learning models, which can incur high computational and storage costs. One solution is to perform model fitting on a carefully selected subset of the data. Various data reduction methods have been proposed in the literature, ranging from random subsampling to optimal experimental design-based methods. However, when the goal is to learn the underlying input-output relationship, such reduction methods may not be ideal, since it does not make use of information contained in the output. To this end, we propose a supervised data compression method called supercompress, which integrates output information by sampling data from regions most important for modeling the desired input-output relationship. An advantage of supercompress is that it is nonparametric – the compression method does not rely on parametric modeling assumptions between inputs and output. As a result, the proposed method is robust to a wide range of modeling choices.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2022 program