Integrating learned and explicit document features for reputation monitoring in social media
作者:Fernando Giner, Enrique Amigó, Felisa Verdejo
摘要
Currently, monitoring reputation in social media is probably one of the most lucrative applications of information retrieval methods. However, this task poses new challenges due to the dynamicity of contents and the need for early detection of topics that affect the reputations of companies. Addressing this problem with learning mechanisms that are based on training data sets is challenging, given that unseen features play a crucial role. However, learning processes are necessary to capture domain features and dependency phenomena. In this work, based on observational information theory, we define a document representation framework that enables the combination of explicit text features and supervised and unsupervised signals into a single representation model. Our theoretical analysis demonstrates that the observation information quantity (OIQ) generalizes the most popular representation methods, in addition to capturing quantitative values, which is required for integrating signals from learning processes. In other words, the OIQ allows us to give the same treatment to features that are currently managed separately. Empirically, our experiments on the reputation-monitoring scenario demonstrated that adding features progressively from supervised (in particular, Bayesian inference over annotated data) and unsupervised learning methods (in particular, proximity to clusters) increases the similarity estimation performance. This result is verified under various similarity criteria (pointwise mutual information, Jaccard and Lin’s distances and the information contrast model). According to our formal analysis, the OIQ is the first representation model that captures the informativeness (specificity) of quantitative features in the document representation.
论文关键词:Document representation, Information theory, Machine learning, Microblogging
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10115-019-01383-w