681 – Smoothing Splines and Applications
Testing Record Linkage Production Data Quality
K. Bradley Paxton
ADI, LLC
Record Linkage is used to find common entities (e.g., persons, households, or businesses) between pairs of data records in disparate data files. Once these links are found, an improved data set may be obtained by merging the matched entity data. This resulting improved data set could then be used for the appropriate business purpose or further examined by "data mining". If, however, the record linkage is done poorly, the "improved" data set might actually be worse than before. Testing the production output data quality for record linkage systems is very difficult - most find it so difficult they barely do it at all. This means many practitioners of record linkage don't know precisely how well their system actually works, much less how to make it better. In this paper, we outline a way to use automation to enable the efficient measurement of record linkage data quality in production or in development testing using "real" data. We call our automated testing approach RLPDQ, which stands for Record Linkage Production Data Quality, and it is an extension of the PDQ system that was used successfully in the 2010 Census to measure data capture quality in forms processing.