Online Program

Measuring Precision and Recall with Social Medial Data

Sherry Emery, Institute for Health Research Policy 
*Yoonsang Kim, UIC Institute for Health Research and Poclity 
Jidong Huang, Institute for Health Research Policy 

Keywords: Twitter, Gibbs sampler, random sampling

Social media have received considerable attention as a new data source for public health. We are particularly interested in the use of social media data to see the share of voice about tobacco control policy, antismoking campaign, tobacco marketing tactics, etc. However, social media data are messy and diffuse. Keywords are the lens through which we can see what people are saying on social media platforms, but face validity is not a sufficient criterion for selecting keywords used to collect social media data. We will describe how the development of the Health Media Collaboratory (HMC) smoking-related Twitter archive produced insights that highlight the importance and statistical challenges of measuring precision (positive predictive value) and recall (sensitivity). We propose a theoretical framework for calculating precision and recall with Twitter data. We further propose two methods to estimate them when there is no gold standard: 1) empirical approach based on random sampling and 2) Bayesian estimation using Gibbs sampling. We will illustrate these methods with examples from the archives of Twitter API and HMC’s smoking-related firehose.