Keywords: Predictive Analytics, Electronic Health Record, Machine Learning, Imputation, Feature Selection, Artificial Intelligence, Precision Medicine
Increased use of electronic health records has provided massive amounts of patient data. The development of predictive analytics using machine learning techniques is essential to obtain insights in clinically-relevant outcomes using historical EHR data. In this abstract, we present a predictive analytics pipeline to process EHR data and then implement it for risk assessment in new patients. Step 1: Data preprocessing. The input features from phenotypic and genomic resources are standardized. Features with substantial missing rates are flagged with binary indicator to represent missingness. Features with low missing rates are imputed using multiple imputation by chained equations. Step 2: Feature selection. Techniques including filter methods, wrapper methods and embedded methods are adopted to determine the optimal feature subset. The selected feature subset will substitute the original feature set in the subsequent modeling steps. Step 3: Model evaluation. For classification algorithms (Linear Regression, Support Vector Machine, Random Forest, and eXtreme Gradient Boosting), the hyper-parameters are tuned by cross-validated search over a predefined number of parameter settings. Model performance is evaluated using AUC curve, and confusion matrix statistics. Optimal cut-off in probability is determined based on the trade-off between false positives and false negatives. Step 4: Model implementation. To risk stratify real-time patients using the optimal classification algorithm, the selected features are monitored and assessed for early identification of clinical outcomes. The predictive analytics pipeline we developed can stratify patients to different tiers of risk, facilitate early identification of high-risk patients and help optimize care delivery in large health care system. Future work is desired to verify the prediction algorithms externally for maximally portable and generalizable algorithms that can be widely adopted in diverse clinical settings.