Improving NCD accuracy by combining document segmentation and document distortion
作者:Ana Granados, Rafael Martínez, David Camacho, Francisco de Borja Rodríguez
摘要
Compression distances have been applied to a broad range of domains because of their parameter-free nature, wide applicability and leading efficacy. However, they have a characteristic that can be a drawback when applied under particular circumstances. Said drawback is that when they are used to compare two very different-sized objects, they do not consider them to be similar even if they are related by a substring relationship. This work focuses on addressing this issue when compression distances are used to calculate similarities between documents. The approach proposed in this paper consists of combining document segmentation and document distortion. On the one hand, it is proposed to use document segmentation to tackle the above mentioned drawback. On the other hand, it is proposed to use document distortion to help compression distances to obtain more reliable similarities. The results show that combining both techniques provides better results than not applying them or applying them separately. The said results are consistent across datasets of diverse nature.
论文关键词:Algorithmic information theory, Data compression, Information filtering, Word removal, Document representation
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10115-013-0664-4