Online Program

Return to main conference page
Saturday, May 19
Data Science
Data Science in Health
Sat, May 19, 1:15 PM - 2:45 PM
Grand Ballroom G

The Story of Goldilocks and Three Twitter APIs (304554)

Sherry Emery, NORC at the University of Chicago 
*Yoonsang Kim, NORC at the University of Chicago 

Keywords: social media, data quality, disclosure standard

Public health science has increasingly used Twitter for behavioral and marketing surveillance and topic discovery. However, few studies provide sufficient detail about how Twitter data were collected. Indeed, in addition to keyword selection, even the point of access of Twitter data can vary across studies. The three primary API of Twitter data sources are Streaming, Search, and Firehose, but little is known about the advantages and limitations of each. Such information is crucial to the interpretation, validity, and replicability of research findings. We examined whether tweets collected using the same search filters and time period, but different APIs, would retrieve comparable amounts and content of Twitter data. We collected tweets from Jan-Jun 2015 about anti-smoking, e-cigarettes, and tobacco using the Streaming API, the Search API, and PowerTrack (Firehose archive). These topics were intended to capture variability by topic popularity and content. For tweets related to tobacco and e-cigarette, the Streaming API retrieved the largest number of tweets, followed by PowerTrack. For anti-smoking tweets, PowerTrack retrieved the largest amount. The content of retrieved tweets largely overlapped between 3 APIs, but each API also retrieved unique tweets, contributing to different trends and contents. The Streaming retrieved more e-cigarette commercial accounts. The Search API data did not well represent spikes in conversations. Researchers should understand how different data sources can influence both the amount and content of data they retrieve from social media, in order to assess the implications on the interpretation of results. Further, researchers should routinely disclose the source and rationale behind the choice of source. Although we used Twitter as a use-case, understanding and disclosing data sources and the quality of retrieved data are important steps in rigorous data analysis, and are critically important for study replicability.