A Novel Scaled dirichlet-based statistical framework for count data modeling: Unsupervised learning and exponential approximation

作者:

Highlights:

• We propose a novel model called the Multinomial Scaled Dirichlet (MSD) for modeling count data.

• We derive a new family of distributions that are approximations to MSD distributions to handle high-dimensional and sparse data that we call (EMSD).

• We develop a minimum message length (MML) criterion for determining of the number of components in EMSD mixture model.

• We evaluate the performance of both approaches through a set of extensive empirical experiments on challenging real-world applications.

• Results revealed that both MSD and EMSD capture the burstiness phenomenon successfully and correctly, and EMSD is many times faster than MSD.

摘要

•We propose a novel model called the Multinomial Scaled Dirichlet (MSD) for modeling count data.•We derive a new family of distributions that are approximations to MSD distributions to handle high-dimensional and sparse data that we call (EMSD).•We develop a minimum message length (MML) criterion for determining of the number of components in EMSD mixture model.•We evaluate the performance of both approaches through a set of extensive empirical experiments on challenging real-world applications.•Results revealed that both MSD and EMSD capture the burstiness phenomenon successfully and correctly, and EMSD is many times faster than MSD.

论文关键词:Count data,Burstiness,DAEM,Multinomial,Scaled dirichlet,Finite mixture models,Exponential family approximation,Model selection,Text collection,Image databases

论文评审过程:Received 3 May 2018, Revised 25 April 2019, Accepted 30 May 2019, Available online 1 June 2019, Version of Record 5 June 2019.

论文官网地址:https://doi.org/10.1016/j.patcog.2019.05.038