Abstract:
|
In statistics, measuring uncertainty is equally important as getting the point estimate. For text datasets, latent Dirichlet allocation (LDA) is one of the most commonly used topic modeling algorithms. I discovered that keeping special phrases in text cleaning improves the topic distinctivity at the word level. In addition, I also used a synthetic dataset with known proportions to test how LDA performs under different settings. No matter what the number of topics is pre-set to, LDA tends to "spread out" the topic assignments, making it difficult to remove excessive topics.
|