Postal address extraction from the web: a comprehensive survey

作者:Mohammed Kayed, Sara Dakrory, A. A. Ali

摘要

The Web is a source of information for Location-Based Service (LBS) applications. These applications lack postal addresses for the user’s Point of Interests (POIs) such as schools, hospitals, restaurants, etc., as these locations are annotated manually by using the yellow pages or by the location owners (users/companies). Our study in this paper confirms that Google Maps, a common LBS application, only contains about \(32.5\%\) of the public schools that are registered officially in the documents provided by the Directorate of Education in Egypt. However, the remaining missed school addresses could be fished from the Web (e.g., social media). To the best of our knowledge, no prior survey has been published to compare the previous Web postal address extraction approaches. Additionally, all proposed approaches for address extraction are local (could be working in specific countries/locations with particular languages) and could not be used or even adapted to work in other countries/locations with other languages. Furthermore, the problem of Web postal address extraction is not addressed in many countries such as Arab countries (e.g. Egypt). This paper discusses the issue of address extraction, highlights and compares the recently used techniques in extracting addresses from Web pages. In addition, it investigates the discrepancy of knowledge among existing systems. Moreover, it provides a comprehensive review of the geographical Gazetteers used in the Web postal address approaches and compares their data quality dimensions.

论文关键词:Postal Address Extraction, Web Information Extraction, Gazetteers, Machine Learning

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10462-021-09983-1