Variance Estimation in Respondent-Driven Sampling: Implications for Research Design
*Douglas D. Heckathorn, Cornell University
Keywords: hidden populations, sampling hard-to-reach populations, research design
Respondent-Driven Sampling (RDS) has become the method of choice in studies of hidden and hard-to-reach populations yet important questions regarding the method remain unresolved (Heckathorn 1997, 2002, Salganik and Heckathorn 2004, and Heckathorn 2007). RDS is a form of network sampling, in which a convenience sample of initial subjects serve as “seeds” who recruit several peers, these respondents in turn recruit further peers. The sample expands in this manner, wave by wave, until the desired sample size has been reached. The popularity of RDS derives in part from a proof (Salganik and Heckathorn 2004) showing that when the assumptions of the method are satisfied, population estimates are asymptotically unbiased. This means that bias is only on the order of 1/[sample size], so bias is trivial in samples of significant size.
A problem inherent in the method is that design effects become large when networks contain choke points. That is, when groups are Balkanized into subgroups with extreme homophily. Specifically, as homophily increases, design effects increase exponentially, so high homophily systems can have design effects so large that use of the sampling method is impractical.
Controversies regarding RDS research designs have resulted from analysts focusing on systems with differing design effects. The issues include whether data from the seeds or early waves should be discarded, as when a “burn in” procedure is used in Markov analysis; and whether each recruitment chain (i.e., each set of subjects who share the same seed) should be treated as distinct data sets. This paper presents an analysis showing how the optimal research design varies as a function of the sample’s design effect. The paper also shows how high-design effect samples can be converted into low design effect samples by partitioning the sample at the network choke points.