Abstract:
|
Motivated by online advertising and personalized medicine, this paper proposes an algorithm for the batched bandit problem with high dimensional covariate, named as {\em Teamwork LASSO Bandit}. Specifically, we consider a multi-armed bandit problem in which arms are sampled in batches instead of one at a time. Meanwhile, high-dimensional covariate of each user is available to online decision makers. In this setup, the objective is to learn a teamwork strategy, together with a batch sampling scheme, that minimizes the accumulated total regret. To address a batch version of explore-exploit dilemma, we propose a novel framework to perform batch sampling by carefully alternating teamwork stage and greedy stage during the whole sequential decision making process. In theory, under a fundamental teamwork strategy, an asymptotic upper bound is provided for the expected cumulative total regret of the resulting algorithm.
|