Keywords: Fraud detection, Call Detail Record analysis, Hidden Markov model, Viterbi algorithm, MCMC, Gibbs sampling.
Abuse of telephony services is a leading cause of financial losses in the telecommunications industry. One common type of abuse is the automated generation of calls with the assistance of specialized devices, which is considered an infraction in standard contracts. This type of fraud is often detectable through the analysis of the users' call logs. Telecommunication companies routinely analyze subscribers' call-logs with the hope of catching abusers. Current detection techniques are mostly ad-hoc, based on simple descriptive statistics or classification models over aggregated data. Although they can detect the most obvious fraud cases, they fail under more complex scenarios with lower outgoing traffic. Hence, those methods are ineffective and easily defeatable. Calling patterns featuring periods of high frequency of outgoing calls in rapid succession, or "bursts", are likely to correspond to fraudulent users. In this work, we propose a method for detecting potential fraud cases based on the principled analysis of the temporal patterns of outgoing calls. Our method uses a discrete-time Hidden Markov model with continuous emissions to model the sequences of outgoing calls distinguishing normal use from fraudulent use. We propose a full Bayesian specification and an estimation method based on a combination of MCMC sampling and the Viterbi algorithm. Our method is capable to estimate the probability that a given user is engaged in fraud. We analyze a real dataset consisting of all the outgoing calls from a sample of 5,000 subscribers of a Peruvian communications company from the month of May 2017. With the exception of three confirmed fraud cases, these data are completely unlabeled. Results show that our method is effective in identifying potential fraud cases. To our knowledge, this proposal is the first integrated methodology based on the analysis of temporal call patterns that produces estimates of the probability of fraudulent use in a fully unsupervised setting.