Extracting multiple news attributes based on visual features
作者:Wei Liu, Hualiang Yan, Jianguo Xiao
摘要
The problem of automatically extracting multiple news attributes from news pages is studied in this paper. Most previous work on web news article extraction focuses only on content. To meet a growing demand for web data integration applications, more useful news attributes, such as title, publication date, author, etc., need to be extracted from news pages and stored in a structured way for further processing. An automatic unified approach to extract such attributes based on their visual features, including independent and dependent visual features, is proposed. Unlike conventional methods, such as extracting attributes separately or generating template-dependent wrappers, the basic idea of this approach is twofold. First, candidates for each news attribute are extracted from the page based on their independent visual features. Second, the true value of each attribute is identified from the candidates based on dependent visual features such as the layout relationships among the attributes. Extensive experiments with a large number of news pages show that the proposed approach is highly effective and efficient.
论文关键词:Web data extraction, News article, Visual feature, Markov Logic Networks
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10844-011-0163-6