Online Program

Return to main conference page
Friday, May 31
Machine Learning
Machine Learning E-Posters, I
Fri, May 31, 9:45 AM - 10:45 AM
Grand Ballroom Foyer

Overcoming Big Data: Linking the 2014 National Hospital Care Survey to the 2014/2015 Medicare CMS Master Beneficiary Summary File (306352)

*Scott Robert Campbell, National Opinion Research Center at University of Chicago 
Lisa B Mirel, National Center for Health Statistics 
Dean Resnick, National Opinion Research Center at University of Chicago 

Keywords: Record Linkage, Machine Learning, Parallel Processing, Type I Error, Type II Error

Record linkage enables survey data to be linked to other data sources, expanding the analytic potential of both the survey and the administrative data. However, depending on the number of records being linked the processing time can be prohibitive. As part of a recent project at the National Center for Health Statistics, linking patient records from the 2014 National Hospital Care Survey to the 2014/2015 Centers for Medicare and Medicaid Service’s Master Beneficiary Summary File, a new method was needed to link the two sources because of their size. A record linkage algorithm, based on the Fellegi-Sunter paradigm, was developed which incorporated machine learning techniques.

The algorithm followed a highly structured flow and called upon several techniques to improve efficiency while maintaining the integrity of the linkage. One such technique used is parallel processing built on a flexible, modular coding scheme. Additional efficiency was gained by optimizing the work flow required by the record linkage blocking scheme using a machine learning approach known as sequential coverage algorithm (SCA). Utilizing a “truth source” created from a deterministic linkage matching on exact Social Security Number (SSN) agreement, the SCA reduced the number of linked pairs requiring evaluation while retaining a high percentage of true positive matches.

Pairs generated by the optimized work flow were then evaluated by summing agreement pattern weights which were computed as a function of agreement/non-agreement probabilities. A logistic regression model, using SSN agreement as a proxy for match validity, was used to estimate probabilities of linkage validity according to the summed agreement (pair) weights. Finally, pairs were selected as links when they meet a probability cutoff which was optimized to select the minimum estimated type I and type II error. This presentation will outline the steps used in the algorithm and show results of the linkage process.