Abstract:
|
Topic modeling is a useful tool for examining latent structures in a corpus of documents. Latent Dirichlet Allocation (LDA) is a popular topic modeling method that assumes a Bayesian generative model for collections of exchangeable binary observations such as the presence or absence of words within a document. The degree to which an LDA model is useful for modeling a corpus depends, in part, on the number of topics selected. Too few topics can result in an LDA model that does not provide sufficient separation of topics and too many topics can result in a model that is overly complex and difficult to interpret. Several ad hoc, heuristic methods for selecting the proper number of topics have been proposed. These typically require that the LDA model be fit over a varying number of topics and the performance of the resulting model be measured by some criteria such as perplexity, rate of perplexity change, and goodness of fit statistics.
|