Online Program

Return to main conference page

All Times EDT

Thursday, June 4
Machine Learning
Software & Data Science Technologies
Machine Learning and Software and Data Science Technologies Posters
Thu, Jun 4, 2:00 PM - 5:00 PM
TBD
 

TF-IDF-Weighted Similarity Estimates for Unseen Categories (308477)

*Handong David Bang, UNC Chapel Hill Department of Biostatistics 
Feng-Chang Lin, UNC Chapel Hill 

Keywords: Machine Learning, Imputation, Prediction

The problem of generalizability is an issue in machine learning and statistical modeling. It is standard practice to have the training and testing set to be from the same distribution so that models can be robust in their estimates and predictions. However, there are several instances in which unseen data deviate from the training data, for example, a new category or a distribution shift for a feature. Current methods to handle this issue, such as kNN imputation as a preprocessing step, are industry standards, However, the imputation methods only specify a single observation rather than an entire category or cluster. Our method addresses this issue by developing a novel technique for deriving estimates for unseen categories in a categorical variable, which allows for both demonstrable improvements in prediction and causal inference. The method involves subsetting the data into its categorical partitions and then using information across and between each category through a customized TF-IDF encoding as similarity weights for the target statistic estimate of the unseen category. This approach allows for flexibility when handling missing data and unseen categories when running those observations through a prediction model.

Results were compared against common imputation techniques such as kNN and Bayesian ridge [11] across simulation and case studies. In preliminary results, the method performed just as well, if not better than kNN, which is an improvement due to computational efficiency, aggregated inference, and nonparametric modeling. Data used to derive preliminary results include the Titanic and Wake County Sudden Death data.