A probabilistic measure of similarity for binary data in pattern recognition

作者:

Highlights:

摘要

This paper proposes a new index of association, or proximity, between binary vectors that represent features, or characteristics of objects. The choice of such an index is the crucial first step in pattern recognition and exploratory data analysis. Existing proximity indices are reviewed and categorized according to three interpretations on the significance of matches. The proposed index, called the permutation index, is the probability of achieving no more than the observed number of matches under a random labelling, or permutation, hypothesis. The permutation index is quantitative in that its value has a concrete, probabilistic meaning under a reasonable hypothesis of randomness so thresholds for decision making can be established from theory. A randomized index having a uniform distribution is also defined. Several computational aspects of the permutation index are examined and two applications are described, one with questionnaire data and the other with decision tree design.

论文关键词:Similarity index,Permutation statistic,Hypergeometric distribution,Feature selection,Pattern recognition

论文评审过程:Received 27 January 1988, Accepted 2 August 1988, Available online 19 May 2003.

论文官网地址:https://doi.org/10.1016/0031-3203(89)90049-6