Abstract:
|
Machine learning models typically train on one dataset, then assess performance on another. We consider the case of training on a given dataset, then determining which (large batch of) unlabeled candidates to label in order to improve the model further. Each candidate we score by its associated prediction error(s). ¶We concentrate on the large batch case for two reasons: (1) While choose-1-then-update (batch of size 1) successfully avoids near-duplicates, a choose-N-then-update (batch of size N) needs additional constraints to avoid overselecting near-duplicates. (2) Just as large data volumes enable ML, updates to these large data volumes tend also to come in largish batches. ¶Model uncertainty we estimate by 50-percent samples without replacement. Using a two-level orthogonal array with n columns, the resulting maximally balanced half-samples achieve high efficiency; the result is one model for each column of the orthogonal array. We use the associated n-dimensional representation of prediction uncertainty to choose which N candidates to label. ¶We illustrate by fitting keras-based neural networks to the MNIST handwritten digit dataset.
|