Abstract:
|
The goal of mHealth is to encourage the user’s healthy behavior, namely reward, through sequential interventions such as sending particular messages. Since the reward of a particular intervention varies according to user characteristics and time, the system needs to learn the relationship between the reward and the user's contextual information in the course of making choices and receiving rewards. Contextual multi-armed bandit (MAB) algorithms have been shown promising for maximizing cumulative rewards in sequential decision tasks under uncertainty when contextual information is given. However, most of the existing contextual MAB algorithms rely on strong, linear assumptions between the reward and the context, which can be inappropriate for mHealth settings. We propose a new contextual MAB algorithm for a relaxed, semiparametric reward model that supports nonstationarity. We show that the high-probability upper bound of the regret incurred by the algorithm has the same order as the Thompson sampling algorithm for linear reward models without restricting action choice probabilities. We evaluate the proposed and existing algorithms via simulations and application to real data.
|