Content-aware web robot detection
作者:Athanasios Lagopoulos, Grigorios Tsoumakas
摘要
Web crawlers account for more than a third of the total web traffic and they are threatening the security, privacy and veracity of web applications and their users. Businesses in finance, ticketing, and publishing, as well as websites with rich and unique content are the ones mostly affected by their actions. To deal with this problem, we present a novel web robot detection approach that takes advantage of the content of a website based on the assumption that human web users are interested in specific topics, while web robots crawl the web randomly. Our approach extends the typical user session representation of log-based features with a novel set of features that capture the semantics of the content of the requested resources. In addition, we contribute a new real-world dataset, which we make publicly available, towards alleviating the scarcity of open data in this field. Empirical results on this dataset validate our assumption and show that our approach outranks state-of-the-art methods for web robot detection.
论文关键词:Web robot, Crawler, Semantics, Supervised learning, Latent dirichlet allocation
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10489-020-01754-9