GAEMTBD: Genetic algorithm based entity matching techniques for bibliographic databases

作者:Sumit Mishra, Sriparna Saha, Samrat Mondal

摘要

Entity matching is to map the records in a database to their corresponding entities. It is a well-known problem in the field of database and artificial intelligence. In digital libraries such as DBLP, ArnetMiner, Google Scholar, Scopus, Web of Science, AllMusic, IMDB, etc., some of the attributes may evolve over time, i.e., they change their values at different instants of time. For example, affiliation and email-id of an author in bibliographic databases which maintain publication details of various authors like DBLP, ArnetMiner, etc. may change their values. A taxpayer can change his or her address over time. Sometimes people change their surnames due to marriage. When a database contains records of these natures and the number of records grows beyond a limit, then it becomes really challenging to identify which records belong to which entity due to the lack of a proper key. In the current paper, the problem of automatic partitioning of records is posed as an optimization problem. Thereafter, a genetic algorithm based automatic technique is proposed to solve the entity matching problem. The proposed approach is able to automatically determine the number of partitions available in a bibliographic dataset. A comparative analysis with the two existing systems – DBLP and ArnetMiner, over sixteen bibliographic datasets proves the efficacy of the proposed approach.

论文关键词:Entity matching, Genetic algorithm, Cluster validity index, Distance measure, Record similarity, Bibliographic database

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-016-0874-z