JSM 2015 Preliminary Program

Online Program Home
My Program

Abstract Details

Activity Number: 258
Type: Contributed
Date/Time: Monday, August 10, 2015 : 2:00 PM to 3:50 PM
Sponsor: Section on Statistical Learning and Data Mining
Abstract #317295
Title: A New Framework for Scalable, Accurate, and Intuitive Data Matching
Author(s): Brian P. Kent* and Robert Voyer
Companies: Dato and Dato
Keywords: record linkage ; data matching ; data fusion ; data quality ; machine learning ; scalable computation
Abstract:

Matching data records that correspond to the same real-world entity is a critical phase of many data analysis pipelines, particularly when datasets are assembled from multiple sources. Most existing record linkage techniques are too complicated for non-experts to implement on their own, while publicly available software tools work with only relatively small datasets, are restricted to specific domains like medical record deduplication, or are not actively maintained.

We present a framework to elegantly unify the data matching tasks of deduplication, record linkage, and auto-tagging in a way that is intuitive and useful for novices, yet fully expressive and powerful for experts. We also describe a suite of algorithms for solving each task efficiently while giving users the freedom to specify and update domain-specific distance functions. Our data matching framework is implemented in the GraphLab Create machine learning library, whose out-of-core data structures and parallel computation allow very large data matching problems to be solved quickly on a single commodity machine.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2015 program





For program information, contact the JSM Registration Department or phone (888) 231-3473.

For Professional Development information, contact the Education Department.

The views expressed here are those of the individual authors and not necessarily those of the JSM sponsors, their officers, or their staff.

2015 JSM Online Program Home