Abstract:
|
An important question in constructing cross validation estimators of the generalization error is whether rules could be established that allow the optimal selection of the size of the training set, for fixed sample size n. We define the resampling effectiveness of random CV estimators of the generalization error as the ratio of the limiting value of the variance of the CV estimator over the estimated from the data variance. The variance and the covariance of different average test set errors are independent of their indices, thus, the resampling effectiveness depends on the correlation and the number of repetitions used in the random CV estimator. We discuss the statistical rules to define optimality and obtain the "optimal" training sample size as the solution of an appropriately formulated optimization problem. We show that in a broad class of smooth loss functions, and in particular for the q-class of loss functions, when the decision rule is the sample mean the problem of obtaining the optimal training sample size has a general solution independent of the data distribution. The analysis offered when the decision rule is regression illustrates the complexity of the problem.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.