Abstract:
|
Computational social science is the study of social phenomena using digitized information and streaming data along with computational and statistical methods. Most computational social scientists begin with the underlying structure of the data before building scalable models and algorithms. Two types of applications with difficult structure are record linkage and community detection. Record linkage is the process of merging large, potentially noisy databases to remove duplicate entities. Community detection is the process of placing entities into similar partitions or "communities." Both applications are critical to fundamental applications in author disambiguation, genetics, official statistics, and human rights conflict. It is common to treat record linkage and community detection as clustering tasks. In fact, most generative models for clustering implicitly assume the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman-Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. For example, when performing record linkage, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks therefore require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the \emph{microclustering property} and discussing a new model that exhibits this property. We talk about successes regarding this new approach to applications in official statistics and the Syrian conflict.
|