Document analysis and visualization with zero-inflated poisson

作者:Dora Alvarez, Hugo Hidalgo

摘要

Data visualization is aimed at obtaining a graphic representation of high dimensional information. A data projection over a lower dimensional space is pursued, looking for some structure on the projections. Among the several data projection based methods available, the Generative Topographic Mapping (GTM) has become an important probabilistic framework to model data. The application to document data requires a change in the original (Gaussian) model in order to consider binary or multinomial variables. There have been several modifications on GTM to consider this kind of data, but the resulting latent projections are all scattered on the visualization plane. A document visualization method is proposed in this paper, based on a generative probabilistic model consisting of a mixture of Zero-inflated Poisson distributions. The performance of the method is evaluated in terms of cluster forming for the latent projections with an index based on Fisher’s classifier, and the topology preservation capability is measured with the Sammon’s stress error. A comparison with the GTM implementation with Gaussian, multinomial and Poisson distributions and with a Latent Dirichlet model is presented, observing a greater performance for the proposed method. A graphic presentation of the projections is also provided, showing the advantage of the developed method in terms of visualization and class separation. A detailed analysis of some documents projected on the latent representation showed that most of the documents appearing away from the corresponding cluster could be identified as outliers.

论文关键词:Document visualization, Zero-inflated Poisson, Generative model

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10618-009-0127-4