Conference Program

Return to main conference page

All Times ET

Thursday, June 9
Practice and Applications
Improving Algorithms for Big Data
Thu, Jun 9, 3:45 PM - 5:15 PM
Allegheny Grand Ballroom
 

Building the Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large-Scale Data (310052)

Presentation

*Wenru Zhou, University of Colorado 
Miranda Kroehl, Charter Communications 
Alex Kaizer, University of Colorado 
Maxene Meier, Charter Communications 

Keywords: Interim monitoring, A/B testing, Error spending function, Stopping rules

The use of error spending functions and stopping rules has become a powerful tool in interim analysis. The implementation of interim analysis in AB test is broadly desired not only in traditional clinical trials, but also in AB tests. Although many papers have summarized error spending approaches, a comprehensive review is needed, targeting on large-scale data, to help people in industry find their optimal boundary easily. In this paper, we summarized sixteen existing boundaries including fifteen boundaries consisted of five error spending functions that allow early termination for futility, difference, or both. Fixed sample size design is included as the sixteenth boundary. The simulation is based on a practical A/B testing problem comparing two independent proportions. Sample sizes changes from approximate 500 per arm to 250,000 per arm. One, three, and nineteen interim analyses are included. The choices of optimal boundaries are summarized using a loss function that incorporates different weight of expected sample size under null, alternative, and maximum sample size. The results based on adequate power, under-powered, and over-powered design, are presented. In terms of general approaches to “optimal” designs in A/B test designs, we do recommend based on our simulation results for adequately powered studies that designs with sequential monitoring that stop for either some detectable difference between variants or for futility could be used in most cases. We further posit that it may be most efficient to use O’Brien-Fleming boundaries stopping for both since this design had similarly low loss function values to the power boundary with rho=2 and rho=3. When considering designs that implemented while being intentionally over-powered, the conclusion is similar to adequately powered design. However, for under-powered design, it is hard to summarize an efficient boundary for all weights, though we do prefer stop for futility and stop for both than stop for difference.