Abstract:
|
Cross-validation (CV) methods are widely used to estimate out-of-sample prediction error. In big data problems, analytical formulas are attractive alternatives since CV methods are computationally expensive. If the parameter estimation time is T seconds for a data set with N records, the leave-one-out CV estimation time is TN seconds. Linhart and Volkers (1984: also see Linhart and Zucchini, 1986) showed a particular large sample analytic out-of-sample prediction error estimator was an unbiased estimator of CV estimation error for a large class of smooth empirical risk functions resulting in an estimation time of T rather than TN seconds. This theoretical result is an extension of the Takeuchi Information (Takeuchi, 1976) and Akaike Information (Akaike, 1973) Criteria. We provide easily verifiable assumptions for this theoretical result to hold. In addition, we report empirical results for logistic regression modeling that show the mean relative deviation between a nonparametric bootstrap CV estimator and the analytic out-of-sample prediction error estimator was less than 0.3% for three different data sets with respective sample sizes of n=583, n = 1728, and n = 4898 records.
|