Using linguistic features to automatically extract web page title

作者：

Highlights：

• Successful title extraction must analyze both the DOM nodes and the title tag.

• Natural language processing improves the quality of the title.

• Visual and formatting features are less relevant for the task.

• Simpler classifier like k-NN perform as well as an advanced classifier like SVM.

• The proposed method significantly outperforms all existing ones by clear margin.

摘要

•Successful title extraction must analyze both the DOM nodes and the title tag.•Natural language processing improves the quality of the title.•Visual and formatting features are less relevant for the task.•Simpler classifier like k-NN perform as well as an advanced classifier like SVM.•The proposed method significantly outperforms all existing ones by clear margin.

论文关键词：Web content mining,Information extraction,Title extraction,Natural language processing,Machine learning

论文评审过程：Received 5 October 2016, Revised 27 February 2017, Accepted 28 February 2017, Available online 2 March 2017, Version of Record 11 March 2017.

论文官网地址：https://doi.org/10.1016/j.eswa.2017.02.045