Abstract:
|
We introduce the notion of “homogeneous words” in a topic model: A homogeneous word is such that its corresponding entry in every topic vector is the same. As a result, it has the same population frequency in all documents, regardless of the topic weights. In real applications, examples of homogeneous words include stop words and sentiment words that carry no topic information. With the existence of homogeneous words, a natural idea of topic modeling is to first identify the set of homogeneous words and then restrict to the dictionary consisting of only inhomogeneous words. However, a naive implementation of this strategy fails to yield the desirable improvement. We discover that the key of improving signal-to-noise ratio is to properly normalize the word counts after word screening. This gives rise to a new spectral method for estimating the weights in a topic model, called inhomogeneous-word PCA. We derive the minimax error rate as a function of the probability mass occupied by inhomogeneous words (it defines “sparsity” in a topic model). We also show that inhomogeneous-word PCA successfully attains the minimax error rate under mild regularity conditions.
|