
Keywords: Hierarchical Dirichlet process, document clustering, Bayesian models
Government surveys of households and establishments typically include inputs collected from interviewer notes that provide a rich source of context and information. We propose to extract themes from the collection of interviewer notes (our documents) by employing a scalable optimization method based on non-parametric mixtures of hierarchical Dirichlet processes that allows discovery of multiple local, by document, themes linked to a set of global themes. Survey data are typically acquired under an informative sampling design where the probability of inclusion depends on the surveyed response, such that the distribution for the observed sample is different from the population. We use a pseudo-posterior with sampling weights that differentially weights the contributions of the document likelihoods to “undo" the informative design, such that we estimate the distribution of themes with respect to the population of establishments or households from which our sample was drawn. The method is applied to the Consumer Expenditure Survey conducted by the Bureau of Labor Statistics.