Name: 2019 Joint Statistical Meetings
Start: 2019-07-27T07:00:00+00:00
End: 2019-08-01
Location: Colorado Convention Center

Abstract Details

Activity Number:	618 - Machine Learning for Big Data
Type:	Contributed
Date/Time:	Thursday, August 1, 2019 : 8:30 AM to 10:20 AM
Sponsor:	Section on Statistical Learning and Data Science
Abstract #304152	Presentation
Title:	ON SUPPLEMENTING TRAINING DATA BY HALF-SAMPLING
Author(s):	William Heavlin*
Companies:	Google, Inc.
Keywords:	I-optimality; jackknife; neural networks; orthogonal arrays; prediction covariance matrix; prediction uncertainty
Abstract:	Machine learning models typically train on one dataset, then assess performance on another. We consider the case of training on a given dataset, then determining which (large batch of) unlabeled candidates to label in order to improve the model further. Each candidate we score by its associated prediction error(s). ¶We concentrate on the large batch case for two reasons: (1) While choose-1-then-update (batch of size 1) successfully avoids near-duplicates, a choose-N-then-update (batch of size N) needs additional constraints to avoid overselecting near-duplicates. (2) Just as large data volumes enable ML, updates to these large data volumes tend also to come in largish batches. ¶Model uncertainty we estimate by 50-percent samples without replacement. Using a two-level orthogonal array with n columns, the resulting maximally balanced half-samples achieve high efficiency; the result is one model for each column of the orthogonal array. We use the associated n-dimensional representation of prediction uncertainty to choose which N candidates to label. ¶We illustrate by fitting keras-based neural networks to the MNIST handwritten digit dataset.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2019 program

JSM 2019 Online Program

Abstract Details

American Statistical Association