Scalable KDE-based top-n local outlier detection over large-scale data streams

作者:

Highlights:

摘要

The detection of local outliers over high-volume data streams is critical for diverse real-time applications in the real world, where the distributions in different subsets of the data tend to be skewed. However, existing methods are not scalable to large-scale high-volume data streams owing to the high complexity of the re-detection of data updates. In this work, we propose a top-n local outlier detection method based on Kernel Density Estimation (KDE) over large-scale high-volume data streams. First, we define a KDE-based Outlier Factor (KOF) to measure the local outlierness score for the data points. Then, we propose the upper bounds of the KOF and an upper-bound-based pruning strategy to quickly eliminate the majority of the inlier points by leveraging the upper bounds without computing the expensive KOF scores. Moreover, we design an Upper-bound pruning-based top-n KOF detection method (UKOF) over data streams to efficiently address the data updates in a sliding window environment. Furthermore, we propose a Lazy update method of UKOF (LUKOF) for bulk updates in high-speed large-scale data streams to further minimize the computation cost. Our comprehensive experimental study demonstrates that the proposed method outperforms the state-of-the-art methods by up to 3,600 times in speed, while achieving the best performance in detecting local outliers over data streams.

论文关键词:Local outlier detection,Data streams,Kernel Density Estimation,Upper-bound based pruning

论文评审过程:Received 9 December 2019, Revised 6 May 2020, Accepted 22 June 2020, Available online 30 June 2020, Version of Record 2 July 2020.

论文官网地址:https://doi.org/10.1016/j.knosys.2020.106186