Abstract:
|
Topic modeling has been used widely to extract the structures (topics) in a collection (corpus) of documents. One popular method is Latent Dirichlet Allocation (LDA).The LDA model result (i.e., the number and types of topics) depends on tuning parameters. Several methods have been proposed and analyzed for selecting these parameters. But all these methods have been developed using real corpus. But with real corpus, the true number and types of topics are unknown and it is difficult to assess how well the data follow the assumptions of LDA. To address this issue, we developed a factorial simulation design to create corpus with known structure that varied on the following four factors:number of topic, proportion of topics in document, size of the vocabulary in topic, and proportion of vocabulary that is contained in document. Result suggest that the quality of LDA fitting depends on the document-topic distribution and the fitting performs the best when the document lengths are at least four times the vocabulary size. We have also proposed a pre-processing method that may be used to increase quality of the LDA result in some of the worst-case scenarios from factorial simulation study.
|