Biomedical-domain pre-trained language model for extractive summarization

作者：

Highlights：

•

摘要

In recent years, the performance of deep neural network in extractive summarization task has been improved significantly compared with traditional methods. However, in the field of biomedical extractive summarization, existing methods cannot make good use of the domain-aware external knowledge; furthermore, the document structural feature is omitted by existing deep neural network model. In this paper, we propose a novel model called BioBERTSum to better capture token-level and sentence-level contextual representation, which uses a domain-aware bidirectional language model pre-trained on large-scale biomedical corpora as encoder, and further fine-tunes the language model for extractive text summarization task on single biomedical document. Especially, we adopt a sentence position embedding mechanism, which enables the model to learn the position information of sentences and achieve the structural feature of document. To the best of our knowledge, this is the first work to use the pre-trained language model and fine-tuning strategy for extractive summarization task in the biomedical domain. Experiments on PubMed dataset show that our proposed model outperforms the recent SOTA (state-of-the-art) model by ROUGE-1/2/L.

论文关键词：Extractive biomedical summarization,Document representation,Pre-trained language model,Fine-tuning

论文评审过程：Received 30 November 2019, Revised 25 March 2020, Accepted 23 April 2020, Available online 30 April 2020, Version of Record 4 May 2020.

论文官网地址：https://doi.org/10.1016/j.knosys.2020.105964