All Times EDT
Keywords: Machine Learning, Imputation, Prediction
The problem of generalizability is an issue in machine learning and statistical modeling. It is standard practice to have the training and testing set to be from the same distribution so that models can be robust in their estimates and predictions. However, there are several instances in which unseen data deviate from the training data, for example, a new category or a distribution shift for a feature. Current methods to handle this issue, such as kNN imputation as a preprocessing step, are industry standards, However, the imputation methods only specify a single observation rather than an entire category or cluster. Our method addresses this issue by developing a novel technique for deriving estimates for unseen categories in a categorical variable, which allows for both demonstrable improvements in prediction and causal inference. The method involves subsetting the data into its categorical partitions and then using information across and between each category through a customized TF-IDF encoding as similarity weights for the target statistic estimate of the unseen category. This approach allows for flexibility when handling missing data and unseen categories when running those observations through a prediction model.
Results were compared against common imputation techniques such as kNN and Bayesian ridge [11] across simulation and case studies. In preliminary results, the method performed just as well, if not better than kNN, which is an improvement due to computational efficiency, aggregated inference, and nonparametric modeling. Data used to derive preliminary results include the Titanic and Wake County Sudden Death data.