A clustering approach to extract data from HTML tables

作者:

Highlights:

• User-friendly HTML tables are a popular means to publish data.

• It is difficult for software agents to leverage them automatically.

• We present a new method to extract their data automatically.

• Its approach is totally unsupervised and builds on genetic clustering.

• It is as effective as the best supervised proposal, but far more efficient.

摘要

•User-friendly HTML tables are a popular means to publish data.•It is difficult for software agents to leverage them automatically.•We present a new method to extract their data automatically.•Its approach is totally unsupervised and builds on genetic clustering.•It is as effective as the best supervised proposal, but far more efficient.

论文关键词:HTML tables,Data extraction,Clustering,Genetic algorithms

论文评审过程:Received 30 October 2020, Revised 10 June 2021, Accepted 27 June 2021, Available online 13 August 2021, Version of Record 13 August 2021.

论文官网地址:https://doi.org/10.1016/j.ipm.2021.102683