Abstract:
|
Matching data records that correspond to the same real-world entity is a critical phase of many data analysis pipelines, particularly when datasets are assembled from multiple sources. Most existing record linkage techniques are too complicated for non-experts to implement on their own, while publicly available software tools work with only relatively small datasets, are restricted to specific domains like medical record deduplication, or are not actively maintained.
We present a framework to elegantly unify the data matching tasks of deduplication, record linkage, and auto-tagging in a way that is intuitive and useful for novices, yet fully expressive and powerful for experts. We also describe a suite of algorithms for solving each task efficiently while giving users the freedom to specify and update domain-specific distance functions. Our data matching framework is implemented in the GraphLab Create machine learning library, whose out-of-core data structures and parallel computation allow very large data matching problems to be solved quickly on a single commodity machine.
|
ASA Meetings Department
732 North Washington Street, Alexandria, VA 22314
(703) 684-1221 • meetings@amstat.org
Copyright © American Statistical Association.