Abstract:
|
Missing data can be problematic as they may reduce the accuracy and reliability of statistics. Imputation creates values and/ or units, that fill in the missingness, in an effort to create a dataset that is more representative of the population and concept of interest. Ideally, imputation methods would be advised by the nature of missingness and, be developed using data available. Unfortunately, imputation models are not always empirically tested due to the large volume of data or timeliness constraints. The Methodology in collaboration with the Data Science Campus investigated the use of supervised Machine Learning (ML) to carry out imputation; using an automated and data driven approach, which would be faster than the current manual/ multi-stage approach. The project used a ML software called XGBoost to directly impute missing values and comparing this to the standard approach. The presentation will cover the key concepts behind XGBoost and the findings from this program of work.
|