Abstract:
|
An ongoing challenge in the analysis of document collections is how to summarize content in terms of a set of inferred themes that can be interpreted substantively in terms of topics. The current practice of parameterizing the themes in terms of most frequent words limits interpretability by ignoring the differential use of words across topics. In this paper, we develop methodology to identify words that are both frequent and exclusive to a theme, and we propose a regularization scheme that leads to better estimates of these quantities. We illustrate the efficacy of word frequency and exclusivity at characterizing topical content on two very large collections of documents, from Reuters and the New York Times. We then carry out two randomized experiments on Amazon Mechanical Turk to demonstrate that topic summaries based on frequency and exclusivity, estimated using the proposed regularization scheme, are more interpretable than currently established frequency based summaries, and that the proposed model produces more efficient estimates of exclusivity than with currently models.
|