Abstract:
|
Anomaly detection has become a hot topic in many areas recently. In our IBM mail transit time measurement study, we developed data cleaning rules using entropy to optimize the same day delivery cut-off time for a panel of IBM maintained, over 10,000 national reporters. Data visualization and correlation analysis were used to define final cut-off times for reporting across different distance buckets. Variables were assessed using the Pearson correlation coefficient to determine which had an influence over cleaning rules. Graphs were produced using various percentiles so the data could be visualized in a simplified manner. Multivariate Gaussian distributions as an unsupervised method were also used for anomaly detection in the data cleaning analysis. The recommendations based on these techniques have been adapted and implemented in the IBM national large scale measurement study (with more than 10 million reporter scans annually) and worked out very well.
|