Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 184 - Recent Advances in Statistical Machine Learning
Type: Invited
Date/Time: Tuesday, August 10, 2021 : 1:30 PM to 3:20 PM
Sponsor: IMS
Abstract #315507
Title: Inhomogeneous-Word PCA for Estimating the Weights in a Topic Model
Author(s): Tracy Ke* and Minzhe Wang
Companies: Harvard University and University of Chicago
Keywords: Topic model; Sparsity; Count data; PCA; Minimax; Screening
Abstract:

We introduce the notion of “homogeneous words” in a topic model: A homogeneous word is such that its corresponding entry in every topic vector is the same. As a result, it has the same population frequency in all documents, regardless of the topic weights. In real applications, examples of homogeneous words include stop words and sentiment words that carry no topic information. With the existence of homogeneous words, a natural idea of topic modeling is to first identify the set of homogeneous words and then restrict to the dictionary consisting of only inhomogeneous words. However, a naive implementation of this strategy fails to yield the desirable improvement. We discover that the key of improving signal-to-noise ratio is to properly normalize the word counts after word screening. This gives rise to a new spectral method for estimating the weights in a topic model, called inhomogeneous-word PCA. We derive the minimax error rate as a function of the probability mass occupied by inhomogeneous words (it defines “sparsity” in a topic model). We also show that inhomogeneous-word PCA successfully attains the minimax error rate under mild regularity conditions.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2021 program