Abstract:
|
We study parameter estimation from data that consists of n independent observations from p-dimensional multinomial distributions with respective parameters N_i and P_i, for each observation i, under the popular topic model assumption. This model links all cell probabilities, collected in a p x n matrix P, via the assumption that P can be factorized as the product of two non-negative matrices A, of dimensions p x K, and W, of dimensions K x n. All three matrices are unknown, have non-negative entries that can be interpreted as conditional probabilities, and sum up to 1 column-wise. The factors are unique, up to column permutations, under slight variations of commonly used separability assumptions. In this work, K, the number of topics, is unknown, in contrast with all previous studies that assume it to be known a priori. We derive finite sample minimax lower bounds for the estimation of A, and develop a new and fast minimax adaptive algorithm. Our finite sample analysis is valid for any n, N_i, p and K. In particular, we allow the number of possible topics, K, to increase with n, a situation not handled well theoretically or practically in previous work.
|