Thompson Sampling is a well-known effective algorithm of reinforcement learning in cases when the probability of reward depends on one categorical variable. Using a combination of unsupervised and supervised learning methods, we generalized the algorithm for the case when the reward depends on multiple categorical and numerical variables, tuned it with a simulation, and applied it to a fraud detection audit.
The method demonstrated good cumulative gain: checking 50% of candidate cases selected by the algorithm we could detect 96% of fraud cases (96% true positive rate) having 99% of related monetary loss (maximum possible reward).
|