Online Program Home
My Program

Abstract Details

Activity Number: 591
Type: Invited
Date/Time: Wednesday, August 3, 2016 : 2:00 PM to 3:50 PM
Sponsor: JASA, Applications and Case Studies
Abstract #318066 View Presentation
Title: A Regularization Scheme on Word Occurrence Rates That Improves Estimation and Interpretation of Topical Content
Author(s): Edoardo M. Airoldi*
Companies: Harvard
Keywords: High-dimensional Data ; Categorical Data ; Hamiltonian Monte Carlo ; Parallel Inference ; Text Analysis

An ongoing challenge in the analysis of document collections is how to summarize content in terms of a set of inferred themes that can be interpreted substantively in terms of topics. The current practice of parameterizing the themes in terms of most frequent words limits interpretability by ignoring the differential use of words across topics. In this paper, we develop methodology to identify words that are both frequent and exclusive to a theme, and we propose a regularization scheme that leads to better estimates of these quantities. We illustrate the efficacy of word frequency and exclusivity at characterizing topical content on two very large collections of documents, from Reuters and the New York Times. We then carry out two randomized experiments on Amazon Mechanical Turk to demonstrate that topic summaries based on frequency and exclusivity, estimated using the proposed regularization scheme, are more interpretable than currently established frequency based summaries, and that the proposed model produces more efficient estimates of exclusivity than with currently models.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2016 program

Copyright © American Statistical Association