Online Program

Friday, February 20
CS15 Social Media Applications Fri, Feb 20, 3:45 PM - 5:15 PM
Napoleon C

Garbage in Garbage Out: Acquisition and Quality Assessment of Social Media Data in Health Research (302906)

Sherry Emery, University of Illinois at Chicago 
Jidong Huang, University of Illinois at Chicago 
*Yoonsang Kim, University of Illinois at Chicago 

Keywords: social media, precision and recall, Bayesian model

Social media have become novel data sources in health research, and the number of studies that use social media data is growing. Health behavior and public opinion on health policy could be observed via social media in a short time frame in large scale. The benefits of social media data come with challenges. Particularly, social media data are messy; lots of noise must be filtered out before analysis. Without the noise filtered out, the quality of inferences will be poor, no matter how good the analytical techniques used are. We propose iterative steps to build search keywords and rules to filter out noise. We describe how to assess the quality of search filters using Twitter data and e-cigarette content as examples, and estimate precision and recall of the search filter under three conditions: human coding is a perfect gold standard, an imperfect gold standard with false negative error, and a classifier that is not a gold standard. We use Bayesian models with Gibbs sampler in the situations that the information contained in the data were insufficient to use a classical method. We discuss the consequences of validating search filters with imperfect gold standard and how to address them.