Abstract:
|
Data mining methods were used to develop models to predict freshman cohort retention from the initial fall term to fall of year two. First-time full-time fall freshmen cohorts from fall 2011 through fall 2018 were used to develop the models. Three methods were employed: chi-squared automatic interaction detection (CHAID), classification and regression trees (CART), and gradient boosting. Unlike CART with binary splits evaluated by misclassification measures, the CHAID algorithm uses the chi-square test (or the F test for interval targets) to determine significant splits and find independent variables with the strongest association with the outcome. Additionally, each method was repeated using random under-sampling for a total of six models. Since the percentage of students leaving is on the order of 10%, the group of returning students was under-sampled such that the number of students who left and the number students retained are equal in the modeled data. The results are compared to determine which method is most successful in predicting retention, and to determine if under-sampling improves the results.
|