Online Program Home
My Program

Abstract Details

Activity Number: 157
Type: Topic Contributed
Date/Time: Monday, August 1, 2016 : 10:30 AM to 12:20 PM
Sponsor: Section on Bayesian Statistical Science
Abstract #319780 View Presentation
Title: Why Popular Bayesian Nonparametric Methods Fail for Sparse Clustering Tasks
Author(s): Rebecca Steorts and Jeffrey Miller and Brenda Betancourt* and Abbas Zaidi and Hanna Wallach
Companies: Duke University and Duke University and and Duke University and University of Massachusetts - Amherst/Microsoft Research
Keywords: clustering ; Bayesian nonparametrics ; mixture models ; record linkage ; community detection ; Dirichlet process

Record linkage merges together large, potentially noisy databases to remove duplicate entities. Community detection is the process of placing entities into similar partitions or "communities." Both applications are important to applications in author disambiguation, genetics, official statistics, human rights conflict, and others. It is common to treat record linkage and community detection as clustering tasks. In fact, generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption. For example, when performing record linkage, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and discussing a new model that exhibits this property. We illustrate this on real and simulated data.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2016 program

Copyright © American Statistical Association