The coefficient of intrinsic dependence (feature selection using el CID)
作者:
Highlights:
•
摘要
Measuring the strength of dependence between two sets of random variables lies at the heart of many statistical problems, in particular, feature selection for pattern recognition. We believe that there are some basic desirable criteria for a measure of dependence not satisfied by many commonly employed measures, such as the correlation coefficient, Briefly stated, a measure of dependence should: (1) be model-free and invariant under monotone transformations of the marginals; (2) fully differentiate different levels of dependence; (3) be applicable to both continuous and categorical distributions; (4) should not have the dependence of X on Y be necessarily the same as the dependence of Y on X; (5) be readily estimated from data; and (6) be straightforwardly extended to multivariate distributions. The new measure of dependence introduced in this paper, called the coefficient of intrinsic dependence (CID), satisfies these criteria. The main motivating idea is that Y is strongly (weakly, resp.) dependent on X if and only if the conditional distribution of Y given X is significantly (mildly, resp.) different from the marginal distribution of Y. We measure the difference by the normalized integrated square difference distance so that the full range of dependence can be adequately reflected in the interval [0, 1]. The paper treats estimation of the CID, provides simulations and comparisons, and applies the CID to gene prediction and cancer classification based on gene-expression measurements from microarrays.
论文关键词:Classification,Correlation,Dependence,Feature-selection,Microarray,Prediction
论文评审过程:Received 20 May 2004, Revised 13 September 2004, Accepted 13 September 2004, Available online 29 December 2004.
论文官网地址:https://doi.org/10.1016/j.patcog.2004.09.002