A supervised machine learning approach to author disambiguation in the Web of Science
作者:
Highlights:
• Machine learning is used for author name disambiguation in the Web of Science.
• Supervised learning via the author identifier of Researcher ID through random forest and logistic regression.
• Name Frequency-based, bibliographic, thematic, and address-based features are used and evaluated.
• Missing first name data is included to make the machine learning robust to quality changes of new data.
• Pairwise paper predictions are clustered into author profiles via infomap graph-community detection method.
• Cluster measure of average K-Metric arrives at >0.78 values and suggest reasonable performance of our appraoch.
摘要
•Machine learning is used for author name disambiguation in the Web of Science.•Supervised learning via the author identifier of Researcher ID through random forest and logistic regression.•Name Frequency-based, bibliographic, thematic, and address-based features are used and evaluated.•Missing first name data is included to make the machine learning robust to quality changes of new data.•Pairwise paper predictions are clustered into author profiles via infomap graph-community detection method.•Cluster measure of average K-Metric arrives at >0.78 values and suggest reasonable performance of our appraoch.
论文关键词:Author name disambiguation,Machine learning,Pairwise classification,Random forest,Community detection,Web of science
论文评审过程:Received 3 October 2020, Revised 4 April 2021, Accepted 13 April 2021, Available online 11 May 2021, Version of Record 11 May 2021.
论文官网地址:https://doi.org/10.1016/j.joi.2021.101166