Abstract:
|
We address the common situation of a firm that is continuously producing potential modifications to their web site. Each modification goes through an A/B test, but the metric of interest is noisy, making it unclear whether it is a good idea to simply ship whenever the metric has a higher value in treatment than control. What statistical analysis should be used? The standard paradigm is to assume a null hypothesis of no change in the metric, and base the decision on a p-value: ship if p is below cutoff, otherwise don't ship. This requires picking a specific p-value cutoff, such as p = 0.10, but the value is typically very ad-hoc. We develop a principled method of selecting the cutoff using decision theory. Our results suggest that this is a promising approach that can yield a practical method of improving A/B testing decisions. Problems with the standard paradigm have been noted before, leading some to propose control of the false discovery rate (FDR). However, it is still necessary to make an arbitrary choice of the level at which to control FDR, and we know of no previous work addressing that question, either for conventional hypothesis testing or FDR.
|