Regency EF
De-Duplication Strategies in Mobile Health Clinical Studies (304051)
*Ariadna Garcia, Stanford UniversityJustin Lee, Stanford University Quantitative Sciences Unit
*Vidhya Balasubramanian, Stanford University Quantitative Sciences Unit
Santosh E Gummidipundi, Stanford University
Ken W Mahaffey, Stanford University
Marco Perez, Stanford University
Mintu Turakhia, Stanford University
*Haley Hedlin, Stanford University Quantitative Sciences Unit
Manisha Desai, Stanford University
Keywords: mobile health, de-duplication, probabilistic matching
Data collected at onboarding in a large pragmatic motivating digital health study (MDHS) can provide sufficient information to implement a probabilistic matching algorithm to determine whether multiple app generated IDs can be attributed to the same person. We developed a matching approach coupled with manual assessment for validation. We calculated similarity scores (SC) which reflected the string distance between 7 identifiers, a value of 0 denoted identical identifiers and higher values denoted more dissimilar pairs. First, the process of using SC was performed by manually classifying dichotomous similarity in a randomly sampled subset (RSS) to determine the optimal cutpoint for the SC to distinguish true versus false matches. Second, using ROC-based methods, a validation process was performed on a different RSS to assess the accuracy of the matching. We illustrated our approach in our MDHS where we de-duplicated over 500,000 records corresponding to over 400,000 participants. Based on our algorithm, a true unique participant ID was generated allowing us to link participant-specific study data together across multiple data sources with over 96% accuracy.