Conference Program

Return to main conference page

All Times ET

Thursday, June 9
Practice and Applications
Improving Algorithms for Big Data
Thu, Jun 9, 3:45 PM - 5:15 PM
Allegheny Grand Ballroom
 

Time Series Anomaly Detection in the Age of Big Data: Matching Data Generation Processes with Algorithms (310105)

Di Hu, Medstar Health Research Institute  
*Gorkem Turgut Ozer, University of New Hampshire 
Courtney Paulson, Southern Utah University 

Keywords: Anomaly Detection, Big Data, Time Series, Outliers

Recent advancements in computing power and capacity to accumulate and store vast amounts of data have led to a proliferation of algorithmic innovations. Statistical models and machine learning algorithms that focus on detecting anomalies in time series data have also proliferated, ranging from using more conventional methods (e.g., interquartile range) to more recent approaches (e.g., autoencoder neural networks). Yet, despite the significant attention focused on developing such algorithms, work investigating the fit between data and algorithms remains limited. In particular, detailed guidance on the performance outcomes of algorithms in different data generation processes for time series is lacking. In this study, we examine the heterogeneity in performance of selected anomaly detection methods for time series, both parametric and nonparametric, on data that differs in the way it was generated at its source. Namely, we test standard statistics (Interquartile Range and Generalized Extreme Studentized Deviate, or IQR and GESD), machine learning (Local Outlier Factor, or LOF), deep learning (Autoencoder Neural Network, or AENN), and Bayesian (Bayesian Abnormal Region Detector, or BARD) methods on data generated by two events that had considerable societal impact in 2020: the resurgence of the Black Lives Matter movement and the effects of the COVID-19 pandemic. Our findings show that observations that are identified as anomalies differ considerably depending on the fit between an algorithm, its assumptions, and the data generation process to which it is applied. Specifically, the structure of the time series data, as defined by the volume, variety, and velocity of the data generation process of an event, leads to heterogeneous algorithmic performance in key dimensions.