Online Program Home
My Program

Abstract Details

Activity Number: 256 - Contributed Poster Presentations: Section on Statistical Learning and Data Science
Type: Contributed
Date/Time: Monday, July 29, 2019 : 2:00 PM to 3:50 PM
Sponsor: Section on Statistical Learning and Data Science
Abstract #306672
Title: Performance of Latent Dirichlet Allocation with Different Topic and Document Structures
Author(s): Haotian Feng*
Companies: Clemson University
Keywords: Topic Modeling; LDA; Simulation; Evaluation

Topic modeling has been used widely to extract the structures (topics) in a collection (corpus) of documents. One popular method is Latent Dirichlet Allocation (LDA).The LDA model result (i.e., the number and types of topics) depends on tuning parameters. Several methods have been proposed and analyzed for selecting these parameters. But all these methods have been developed using real corpus. But with real corpus, the true number and types of topics are unknown and it is difficult to assess how well the data follow the assumptions of LDA. To address this issue, we developed a factorial simulation design to create corpus with known structure that varied on the following four factors:number of topic, proportion of topics in document, size of the vocabulary in topic, and proportion of vocabulary that is contained in document. Result suggest that the quality of LDA fitting depends on the document-topic distribution and the fitting performs the best when the document lengths are at least four times the vocabulary size. We have also proposed a pre-processing method that may be used to increase quality of the LDA result in some of the worst-case scenarios from factorial simulation study.

Authors who are presenting talks have a * after their name.

Back to the full JSM 2019 program