Online Program

Saturday, February 21
PS3 Poster Session 3 & Continental Breakfast Sat, Feb 21, 8:00 AM - 9:15 AM
Napoleon AB

Are You Really Who We Think You Are? Recognizing and Controlling Biases in Statistical Analyses of Linked Data (303027)

*Sigurd Wilson Hermansen, Westat 

Keywords: linkage, identifiers, probabilistic, deterministic, fuzzy, bias, selection, misclassification, duplication

In a new age of web scrapers and devices streaming Big Data, applied statisticians are looking more closely at the quality of linked data and how person or entity linkage errors may bias results of statistical analyses. As a consequence, we are finding data linkage biases. In analyses of linked data coming from different databases, we now have to assess whether, for instance, the educational level or credit rating covariate that comes from a web database actually belongs to the health or payment history outcome in our subject database. Similar concerns worry applied statisticians and data analysts across the whole spectrum of observational research and predictive modeling. Examples of data linkage biases and useful statistics for measuring them lead into a quick review of best practice data linkage and integration, tracing, and “deduplication” methods. Guidelines for practice touch on software licensing questions and ethical and legal obligations for disclosure control.