Abstract:
|
Prediction problems involving data with complex structure are encountered in a wide range of fields, e.g., disease prediction using electronic health record (EHR) data, hospital readmission using administration data, and marketing research using social media data. Machine Learning (ML) tools, such as random forest or neural network, are popularly used to address these types of problems. However, equally important to the use of these ML tools is the data collection, processing, domain knowledge, and interpretability. We present a framework that combines ML and regression modeling techniques to address classification and prediction problems. First, we construct content-interpretable features (or variables) using raw data from multiple resources and domain knowledge. Second, we classify participants into clusters based on the estimated probabilities from the ML tools. Lastly, we perform regression modeling to predict outcome of interest, and to understand important factors or patterns in the combination of constructed features and classification results. Mental health data from HIV-impacted families were used to demonstrate the proposed approach.
|