Termset weighting by adapting term weighting schemes to utilize cardinality statistics for binary text categorization

作者:Dima Badawi, Hakan Altınçay

摘要

This study proposes a novel scheme for termset weighting based on cardinality statistics. Specifically, termsets are evaluated by considering the number of apparent member terms. Based on a recently verified hypothesis that the occurrence of a subset of terms may also transfer worthwhile information about class memberships, the existing term weighting schemes are adapted. Here, the weight of a given termset is computed as the product of two factors. The first is a function of the member term frequencies that exist in the given document, and the second takes into account the numbers of positive and negative training documents in which the same number of members appear. By assigning a non-zero weight to the termsets when a subset of the member terms appears, the discriminative ability of different member term subsets is taken into consideration.

论文关键词:Termsets, Termset cardinality, Termset weighting, Termset selection, Document representation, Text categorization

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-017-0911-6