Online Program Home
  My Program

All Times EDT

Abstract Details

Activity Number: 264 - Frontiers of High-Dimensional Statistics
Type: Invited
Date/Time: Wednesday, August 11, 2021 : 1:30 PM to 3:20 PM
Sponsor: IMS
Abstract #316771
Title: New Estimates of the Wasserstein Distance Between Document-Generating Distributions in Topic Models
Author(s): Florentina Bunea*
Companies: Cornell University
Keywords: Wasserstein distance ; topic models ; finite sample rates; text analysis ; word distribution; topic distribution
Abstract:

We treat the problem of quantifying the pair-wise similarity between documents in a corpus by first identifying documents, alternatively, with (i) discrete distributions on words from a dictionary common to the corpus and (ii) discrete distributions on topics covered in the corpus, under a topic model assumption. A measure of similarity between a pair of documents is then provided by estimates of the Wasserstein distance between either two, document specific, word distributions or two, document specific, topic distributions. We provide computationally feasible estimates of the topic distributions and also new estimates of the word-distributions, in each document, for topic models. We establish sharp finite sample bounds on the estimated Wasserstein-distance between pairs of either topic-distributions or word-distributions. The former distance is typically faster to compute, as the number of topics is much smaller than the dictionary size, whereas the latter is shown to outperform the commonly used Wasserstein distance between empirical-frequency word estimates. We use our theoretical results and semi-synthetic data simulations for practical recommendations.


Authors who are presenting talks have a * after their name.

Back to the full JSM 2021 program