Abstract:
|
A large portion of currently collected enterprise data contains gaps in the quality of the information being captured. This lack of quality in the collection and governance of enterprise data limits the impact of statistical and machine learning models for improving business processes. Constructing a methodological toolkit that quantifies the amount of ‘data-contamination’ can provide guidance on specific improvements and recommendations on data collection, storage, and governance. Additionally, it can improve the relevance of the uncovered insights and predictions of predictive models built with such data. This research will begin with previous methodology attempting to address this issue as well as example data to demonstrate different aspects of data contamination. A simulation study and examples from real-world data will then demonstrate the proposed methodology and illustrate improvements that can be made in predictive modeling, primarily, estimating less biased relationships between predictors and model outcomes. Challenges and future directions will also be discussed.
|