A Novel Scaled dirichlet-based statistical framework for count data modeling: Unsupervised learning and exponential approximation
作者:
Highlights:
• We propose a novel model called the Multinomial Scaled Dirichlet (MSD) for modeling count data.
• We derive a new family of distributions that are approximations to MSD distributions to handle high-dimensional and sparse data that we call (EMSD).
• We develop a minimum message length (MML) criterion for determining of the number of components in EMSD mixture model.
• We evaluate the performance of both approaches through a set of extensive empirical experiments on challenging real-world applications.
• Results revealed that both MSD and EMSD capture the burstiness phenomenon successfully and correctly, and EMSD is many times faster than MSD.
摘要
•We propose a novel model called the Multinomial Scaled Dirichlet (MSD) for modeling count data.•We derive a new family of distributions that are approximations to MSD distributions to handle high-dimensional and sparse data that we call (EMSD).•We develop a minimum message length (MML) criterion for determining of the number of components in EMSD mixture model.•We evaluate the performance of both approaches through a set of extensive empirical experiments on challenging real-world applications.•Results revealed that both MSD and EMSD capture the burstiness phenomenon successfully and correctly, and EMSD is many times faster than MSD.
论文关键词:Count data,Burstiness,DAEM,Multinomial,Scaled dirichlet,Finite mixture models,Exponential family approximation,Model selection,Text collection,Image databases
论文评审过程:Received 3 May 2018, Revised 25 April 2019, Accepted 30 May 2019, Available online 1 June 2019, Version of Record 5 June 2019.
论文官网地址:https://doi.org/10.1016/j.patcog.2019.05.038