Predictive Modelling of Heterogeneous Sequence Collections by Topographic Ordering of Histories
作者:Ata Kabán
摘要
We propose a model-based approach to the twofold problem of prediction and exploratory analysis of heterogeneous symbolic sequence collections. Our model is based on seeking low entropy local representations joined together with a smooth nonlinear mixing process. Low entropy components are desirable, as they tend to be both more interpretable and more predictable. The nonlinear mixing in turn acts as a regulariser, and in addition, it creates a topographic ordering of the sequence histories, which is useful for exploratory purposes. The combination of these two modelling elements is performed through the generative probabilistic formalism, which ensures a flexible and technically sound predictive modelling framework. Unlike previous generative topographic modelling approaches for discrete data, the estimation algorithm associated with our model is designed to scale to large data sets by exploiting data sparseness. In addition, local convergence is guaranteed without the need for tuning optimisation parameters or making approximations to the non-Gaussian likelihood. These characteristics make it the first generative topographic model for discrete symbolic data with large scale real-world applicability. We analyse and discuss the relationship of our approach with a number of models and methods. We empirically demonstrate robustness against varying sample sizes, leading to significant improvements in terms of predictive performance over the state of the art. Finally we detail an application to the prediction and exploratory analysis of a large real-world web navigation sequence collection.
论文关键词:Probabilistic modelling, Generative topographic mapping, Generalisation across multiple sequences, Data prediction, Data explanation, Visualisation
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10994-007-5008-8