Online Program

Return to main conference page
Friday, February 16
CS03 Text Analytics Applications Fri, Feb 16, 9:15 AM - 10:45 AM
Salon D

Approachable, Interpretable Tools for Mining and Summarizing Large Text Corpora in R (303494)

View Presentation View Presentation

*Luke W. Miratrix, Harvard University 

Keywords: text analytics, sparse regression, data mining, R, summarization, text exploration

We present Concice Comparative Summarization (CCS), a general framework for topic-specific summarization that can be used to explore rich text corpora in a variety of different contexts. This framework, built on sparse classi?cation methods is a compromise between the simple word frequency based methods currently in wide use and more model-intensive methods such as Latent Dirichlet allocation (LDA). CCS is, essentially, text regression: we regress labelings of documents onto the high-dimensional counts of all their words and phrases. The resulting small set of phrases found as predictive are then harvested as the summary. CCS allows for significance testing and also provides a tuning parameter for the user to select more general or more specific phrases as desired. We illustrate CCS by showcasing textreg, an easy-to-use R package that also offers a variety of exploratory and vizualization utilitiy functions for text such as methods for extracting all sentence fragments containing a specified phrase, stemming documents in a manner that helps preserve human readability, and generating graphical displays contrasting different summaries.