Abstract:
|
Augmenting survey data with digital traces is a promising direction for combining the advantages of active and passive data collection. However, extracting interpretable measurements from digital traces for social science research is challenging. In this study, we review opportunities and challenges related to measurement that arise when working with digital trace data in combination with survey records. As an empirical demonstration, we show how to obtain meaningful measurements of news media consumption from survey respondents' web browsing data using a natural language processing algorithm that estimates contextual word embeddings from text data. Our approach is particularly relevant when large amounts of text need to be summarized with a few variables only, without loosing too much information about the text itself. While we focus on categorizing text into topics, our approach may likewise be extended to include, for example, polarity of texts. In addition, we show how we can extend our approach to multilingual settings.
|