Dynamic sampling of text streams and its application in text analysis

作者:Gang Tian, Jiajia Huang, Min Peng, Jiahui Zhu, Yanchun Zhang

摘要

A large number of texts are rapidly generated as streaming data in social media. Since it is difficult to process such text streams with limited memory in real time, researchers are resorting to text stream compression and sampling to obtain a small portion of valuable information from the streams. In this study, we investigate the crucial question of how to use less memory space to store more valuable texts to maintain the global information of the stream. First, we propose a text stream sampling framework based on compressed sensing theory, which can sample a text stream with a lightweight framework to reduce the space consumption while still retaining the most valuable texts. We then develop a query word-based retrieval task as well as a topic detection and evolution analysis task on the sample stream to evaluate the performance of the framework in retaining valuable information. The framework is evaluated from several aspects using two representative datasets of social media, including compression ratio, runtime, information reserved rate, and efficiency of the text analysis tasks. Experimental results demonstrate that the proposed framework outperforms baseline methods and is able to complete the text analysis tasks with promising results.

论文关键词:Text stream, Compressed sensing, Sampling, Text analysis

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-017-1039-z