Keywords: Experimentation, A/B testing, metrics
Large-scale experimentation, often referred to as A/B testing, is widely used by technology companies because it lets decision makers evaluate the change in Quality of Experience (QOE) due to new engineering ideas. However, large-scale experimentation presents numerous challenges for the applied statistician due not only to the sheer size of the data but also to complicated dependence structures, e.g. within-unit irregular temporal sampling, time-on-study effects, etc. Two common approaches to handling these challenges are 1) careful metric development and 2) complicated modeling methods. Metric development requires consideration of numerators/denominators, scaling factors, and transformations while formulating appropriate models requires model selection and validation. In an ideal world, we’d invest in both metric and model development, but resource and time constraints often force us to prioritize one over the other. This naturally leads to the question: should we focus on the metrics or the model? We explore this question in the context of system reliability. Reliability measurements (crashes, hangs, page load times, etc.) provide an interesting test case because they substantially affect QOE but have characteristics that make them difficult to model. We use simulation studies to compare bias/variance trade-offs and Type I and II error rates. We end with recommendations for data scientists.