Keywords: Public Health, Natural Language Processing, Machine Learning, Topic Models, Latent Dirichlet Allocation
In fiscal year 2016, the Centers for Disease Control and Prevention (CDC) administered over $5 billion in grants to institutions across the United States and the world. Each grant was administered by one of CDC’s 13 Centers, Institutes or Offices (CIOs), each of which has responsibility for different areas of public health. The scope and content of those grants varied widely – e.g., some were in the form of cooperative agreements with state health departments to conduct surveillance for a specific disease or condition while others were in the form of awards to universities to conduct research related to public health. This paper explores the use of natural language processing and machine learning to uncover common themes, or topics, in the content of those grants. Specifically, a Latent Dirchlet Allocation (LDA) topic model was applied to a corpus of CDC grant abstracts, resulting in topical word clusters that categorized CDC’s recent investments in public health. The results both agreed well with expectations and provided interesting insights into the nature of CDC’s public health grant portfolio.