Abstract:
|
Optimizing a dynamic online advertising system can be difficult for a multi-armed bandit. We utilized a sliding window of data along with a time decay to control overconfidence in the performance of a feature. We combined these techniques with a Thompson Sampling bandit to balance our explore/exploit strategy, minimize shocks during transition periods, and handle no clear winner situations efficiently resulting in ~5% increase in overall payout.
|