JSM 2016 Online Program

Activity Number:	248
Type:	Contributed
Date/Time:	Monday, August 1, 2016 : 2:00 PM to 3:50 PM
Sponsor:	Section on Statistical Learning and Data Science
Abstract #319679	View Presentation
Title:	Selecting the Number of Topics in a Latent Dirichlet Allocation Topic Model
Author(s):	Dale Bowman*
Companies:	University of Memphis
Keywords:	exchangeability ; topic models ; LDA ; perplexity ; goodness of fit
Abstract:	Topic modeling is a useful tool for examining latent structures in a corpus of documents. Latent Dirichlet Allocation (LDA) is a popular topic modeling method that assumes a Bayesian generative model for collections of exchangeable binary observations such as the presence or absence of words within a document. The degree to which an LDA model is useful for modeling a corpus depends, in part, on the number of topics selected. Too few topics can result in an LDA model that does not provide sufficient separation of topics and too many topics can result in a model that is overly complex and difficult to interpret. Several ad hoc, heuristic methods for selecting the proper number of topics have been proposed. These typically require that the LDA model be fit over a varying number of topics and the performance of the resulting model be measured by some criteria such as perplexity, rate of perplexity change, and goodness of fit statistics.

Authors who are presenting talks have a * after their name.