Online Program

Return to main conference page

Thursday, January 11
Thu, Jan 11, 9:00 AM - 10:45 AM
Crystal Ballroom E
Data Integration

Latent Dirichlet Allocation Topic Models Applied to the Center for Disease Control and Prevention’s Grant Portfolio (304144)

*Matthew Keith Eblen, Centers for Disease Control and Prevention 
Robin Wagner, Centers for Disease Control and Prevention 

Keywords: Public Health, Natural Language Processing, Machine Learning, Topic Models, Latent Dirichlet Allocation

In fiscal year 2016, the Centers for Disease Control and Prevention (CDC) administered over $5 billion in grants to institutions across the United States and the world. Each grant was administered by one of CDC’s 13 Centers, Institutes or Offices (CIOs), each of which has responsibility for different areas of public health. The scope and content of those grants varied widely – e.g., some were in the form of cooperative agreements with state health departments to conduct surveillance for a specific disease or condition while others were in the form of awards to universities to conduct research related to public health. This paper explores the use of natural language processing and machine learning to uncover common themes, or topics, in the content of those grants. Specifically, a Latent Dirchlet Allocation (LDA) topic model was applied to a corpus of CDC grant abstracts, resulting in topical word clusters that categorized CDC’s recent investments in public health. The results both agreed well with expectations and provided interesting insights into the nature of CDC’s public health grant portfolio.