Keywords: systematic sampling ; data splitting ; prediction
In this study we propose a new scheme that uses sampling designs such as stratified systematic sampling to optimally split data into training and testing subsets. This procedure will help machine learning algorithms avoid the classical mistake of overfitting. While it might be slightly more computationally expensive it makes up for this apparent weakness by having a better estimate of test error and improve prediction performance. We provide computational evidence to support the benefits of the new proposed sampling designs over the traditional approach of simple random splitting of the data into testing and training, we also present an example to show how simple random sampling to partition data can distort relationship between important covariates and variable of interest for the test dataset.