A relative patterns discovery for enhancing outlier detection in categorical data
作者:
Highlights:
• Frequent-itemset-mining (FIM) has been used to detect outliers in categorical data.
• The previous studies, based on FIM, probably encounter the problem of distortion.
• This paper aims for designing a mechanism without distortion and inefficiency.
• We introduce a new perspective “relative patterns discovery” on association analysis.
• Our method enables users to discover decisive factors towards the right decision.
摘要
Outlier (also known as anomaly) detection technology is widely applied to many areas, such as diagnosing diseases, evaluating credit, and investigating cybercrime. Recently, several studies, based on frequent itemset mining (FIM), have been proposed to detect outliers in categorical data. For efficiency, these FIM-based studies pruned (ignored) the majority of data by either imposing a threshold or restricting the length of the pattern or both, and they further adopted the limited information to evaluate observations. In spite of high efficiency, such a pruning approach encounters the problem of distortion, i.e., the accuracy decreases to a low level of discernment or even causes the contrary judgment in certain cases. In this paper, we introduce the concept relative patterns discovery from a new perspective on association analysis. To efficiently explore the relative patterns, we devise a hash-index-based intersecting approach (called the HA). Based on the knowledge of relative patterns, we propose an unsupervised approach (called the UA) to evaluate which observations are anomalous. Instead of using the limited information, our method can differentiate the features of observations without the problem of distortion. The results of the empirical investigation, conducted with eight real-world datasets on the UCI Machine Learning Repository, demonstrate that our method generally outperforms the previous studies not only in accuracy but also in efficiency. We also demonstrate that the execution complexity of our method is significantly efficient, especially in high-dimensional data. Furthermore, our method can represent a natural panorama of data, which is appropriate in controlled experiments for discovering more decisive factors in outlier detection.
论文关键词:Association analysis,Categorical data,Frequent itemsets mining,Outlier detection,Unsupervised method
论文评审过程:Received 23 September 2013, Revised 27 June 2014, Accepted 20 August 2014, Available online 28 August 2014.
论文官网地址:https://doi.org/10.1016/j.dss.2014.08.006