Estimated sampling variance in Respondent Driven Sampling data: Mathematical derivations, simulated tests on empirical data, and evidence from other forms of chain-referral data collection
*Ashton Michael Verdery, University of North Carolina at Chapel Hill
Keywords: Respondent driven sampling, chain referral sampling, hidden or hard to reach populations, variance, network sampling
Respondent driven sampling (RDS) has become a popular method of sampling hidden populations using referrals from current respondents to recruit new participants. However, a fundamental concern with RDS is its level of sampling efficiency, which directly relates to how accurate any given sample is likely to be. Recent research shows that the sampling variance for RDS can be dangerously high in networks that exhibit clustering and social homophily. Nonetheless, the possibility of high sampling variance in clustered networks would not be a problem if the RDS-derived estimate of variance was accurate, alerting the researcher to situations where RDS was inappropriate or where more data should be collected to improve the precision of the sample. In this paper, we mathematically derive the extent of bias that network structure can create for RDS variance estimators. We then illustrate the extent of these biases using data from 100 university-based friendship networks from Facebook and show that the variance estimates of several key RDS results published in the literature are biased. Finally, we examine what other forms of chain-referral data collection can tell us about sampling variance using data from a network survey of a hidden population of Mexican immigrants in the Research Triangle area of North Carolina.