Abstract:
|
Anomaly detection has been gaining attention in academia and industry, with many algorithms recently developed to identify diverse type of anomalies. Here we consider detection of punctual anomalies, a single data point or a cluster of data points that behave differently than the majority of data points. Many competent algorithms, such as RIDE (Repeated Impossible Discrimination Ensemble), LOF (Local Outlier Factor) and IF (Isolation Forest), involve explicit or implicit distance calculations which can be computationally expensive. Both RIDE and IF are based on measures of point-influence: how easily a point is isolated/how strongly a point influences model fit. The implicit distance measures used by these methods add unnecessary complexity to the computations and randomness to the results. We propose a new method, RSOS (Random Sampling Outlier Score), which uses explicit pairwise distances to construct outlier scores, made computationally efficient through subsampling. Our method outperforms the first three in many scenarios, and exhibits improved or competitive running time. We also investigate methods and impact of variable selection on the anomaly detection procedures
|