Online Program Home
My Program

Abstract Details

Activity Number: 433 - SPEED: Applications of Advanced Statistical Techniques in Complex Survey Data Analysis: Small Area Estimation, Propensity Scores, Multilevel Models, and More
Type: Contributed
Date/Time: Tuesday, July 31, 2018 : 2:00 PM to 2:45 PM
Sponsor: Survey Research Methods Section
Abstract #332916
Title: Machine Learning to Evaluate the Quality of Patient Reported Epidemiological Data
Author(s): Robert L. Wood* and Futoshi Yumoto and Rochelle Tractenberg
Companies: Resonate & Wichita State University and Resonate and Georgetown University
Keywords: data quality; machine learning; epidemiologic data set; decision making; fraud detection score; FDS
Abstract:

Patient reported epidemiological data are becoming more widely available. One new such dataset, the Fox Insight (FI) project, was launched in 2017 to encourage the study of Parkinson's disease and will be released for public access in 2019. Early analyses of responses from the earliest participants suggest that there may be significant fatigue effects on elements that occur later in the surveys. These trends point to potential violations of assumptions of missingness at random (MAR) and completely at random (MCAR), which can limit the inferences that might otherwise be drawn from analyses of these data. Here we discuss a machine learning approach that can be used to evaluate the likelihood that an individual respondent is "doing their best" vs. not. Bayesian network structural learning is used to identify the network structure, and data quality scores (DQS) were estimated and analyzed within- across-each section of a set of seven patient reported instruments. The proportion of respondents whose DQS scores fell below what would be considered a cutoff (threshold) for data that is unacceptably or unexpectedly similar to random responses ranges from a low of 13% to a high of 66%. Our results suggest that the method is not unduly influenced by the length of instruments or their internal consistency scores. The method can be used to detect, quantify, and then plan or choose the method of addressing nonresponse bias, if it exists, in any dataset an investigator may choose - including the FI dataset, once that is made available. The method can also be used to diagnose challenges that may arise in one's own dataset, possibly arising from a misalignment of patient and investigator perspectives on the relevance or resonance of the data being collected.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2018 program