Sampling strategies for information extraction over the deep web

作者:

Highlights:

• First large-scale and fine-grained evaluation of query-based sampling techniques.

• Learned keyword queries perform substantially better than queries derived from tuples.

• Focusing on—and processing exhaustively—effective queries leads to high efficiency.

• Focusing on—and processing in rounds—less-effective queries favors quality.

• Filtering underperforming queries favors sampling efficiency but hurts quality.

摘要

•First large-scale and fine-grained evaluation of query-based sampling techniques.•Learned keyword queries perform substantially better than queries derived from tuples.•Focusing on—and processing exhaustively—effective queries leads to high efficiency.•Focusing on—and processing in rounds—less-effective queries favors quality.•Filtering underperforming queries favors sampling efficiency but hurts quality.

论文关键词:Information extraction,Sampling,Deep web,Text mining,Scalability

论文评审过程:Received 25 December 2015, Revised 21 November 2016, Accepted 23 November 2016, Available online 6 December 2016, Version of Record 6 December 2016.

论文官网地址:https://doi.org/10.1016/j.ipm.2016.11.006