Online Program

Return to main conference page
Friday, May 18
Computational Statistics
Survey Science
Fri, May 18, 10:30 AM - 12:00 PM
Lake Fairfax B
 

Systematic Sampling Design with Application to Data Splitting (304493)

*Redouane Betrouni, George Mason University 
Edward Wegman, George Mason University 

Keywords: systematic sampling ; data splitting ; prediction

In this study we propose a new scheme that uses sampling designs such as stratified systematic sampling to optimally split data into training and testing subsets. This procedure will help machine learning algorithms avoid the classical mistake of overfitting. While it might be slightly more computationally expensive it makes up for this apparent weakness by having a better estimate of test error and improve prediction performance. We provide computational evidence to support the benefits of the new proposed sampling designs over the traditional approach of simple random splitting of the data into testing and training, we also present an example to show how simple random sampling to partition data can distort relationship between important covariates and variable of interest for the test dataset.