Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 216 - Contributed Poster Presentations: Section on Statistics and Data Science Education
Type: Contributed
Date/Time: Tuesday, August 4, 2020 : 10:00 AM to 2:00 PM
Sponsor: Section on Statistics and Data Science Education
Abstract #310922
Title: Using Data Mining Methods to Model the One-Year Retention of First-Time Full-Time Freshmen Cohorts
Author(s): Nora Galambos*
Keywords: CART; CHAID; gradient boosting; undersampling

Data mining methods were used to develop models to predict freshman cohort retention from the initial fall term to fall of year two. First-time full-time fall freshmen cohorts from fall 2011 through fall 2018 were used to develop the models. Three methods were employed: chi-squared automatic interaction detection (CHAID), classification and regression trees (CART), and gradient boosting. Unlike CART with binary splits evaluated by misclassification measures, the CHAID algorithm uses the chi-square test (or the F test for interval targets) to determine significant splits and find independent variables with the strongest association with the outcome. Additionally, each method was repeated using random under-sampling for a total of six models. Since the percentage of students leaving is on the order of 10%, the group of returning students was under-sampled such that the number of students who left and the number students retained are equal in the modeled data. The results are compared to determine which method is most successful in predicting retention, and to determine if under-sampling improves the results.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2020 program