Activity Number:
|
347
- Recent Advances in Clustering and Mixture Models Analysis
|
Type:
|
Topic-Contributed
|
Date/Time:
|
Thursday, August 12, 2021 : 10:00 AM to 11:50 AM
|
Sponsor:
|
Section for Statistical Programmers and Analysts
|
Abstract #317070
|
|
Title:
|
Sparse Topic Modeling: Computational Efficiency and Near-Optimal Algorithms
|
Author(s):
|
Ruijia Wu* and Linjun Zhang and Tony Cai
|
Companies:
|
Department of Statistics, University of Pennsylvania and Rutgers University and University of Pennsylvania
|
Keywords:
|
Topic modeling;
Matrix factorization;
High-dimensional statistics;
Estimation
|
Abstract:
|
Sparse topic modeling under the probabilistic latent semantic indexing (pLSI) model is studied. Novel and computationally fast algorithms for estimation of both the word-topic matrix and the topic-document matrix are proposed and their theoretical properties are investigated. Our algorithm of word-topic matrix first finds anchor words and then solves for the matrix. We also treat the recovery of the topic-document matrix as a multinomial regression problem with non-negativity and column sum constraints. Both minimax upper and lower bounds are established and the results show that the proposed algorithms are rate optimal, up to a logarithmic factor. The simulation results show that the proposed algorithms perform well numerically and are more accurate in a range of simulation settings comparing to the existing literature.
|
Authors who are presenting talks have a * after their name.