Online Program

Return to main conference page

All Times EDT

Friday, September 25
Fri, Sep 25, 11:45 AM - 12:45 PM
Virtual
Poster Session

PS27-Jitter Trees: TF-IDF Similarity-Weighted Estimates for Unseen Categories and Missing Data (301137)

View Presentation

*Handong David Bang, University of North Carolina at Chapel Hill 

Keywords: missing data, imputation, machine learning, prediction

The problem of generalizability is a central issue in machine learning and statistical modeling. It is standard practice to have the training and testing set to be from the same distribution so that models can be robust in their estimates and predictions. However, there are several instances in which unseen data will deviate from the training data. Current methods to handle this issue such as kNN imputation are industry standards, however, the imputation method only specifies a single observation rather than an entire category or cluster. Our method addresses this issue by developing a novel technique for deriving estimates for unseen categories which allows for both demonstrable improvements in prediction and causal inference. The method involves subsetting the data into its categorical partitions and then using information across and between each categories through a customized TF-IDF encoding as similarity weights for the unseen category's estimate. In addition, we then propose a new prediction model Jitter Trees in which a static Decision Tree is able to make predictions for an unseen category incorporating the similarity weights in its decision path. This allows for flexibility when handling missing data, unseen categories when running those observations through a prediction model.

Results were compared against common imputation techniques such as kNN, bayesian ridge, and simple imputation across models such as XGBoost, Random Forest, and LightGBM. In preliminary results, the method performed similarly to kNN which is an improvement due to computational efficiency, aggregate inference, and nonparametric modeling. Data used to derive preliminary results include the Titanic and Wake County Sudden Death data set.