Temporal expression extraction with extensive feature type selection and a posteriori label adjustment
作者:
Highlights:
•
摘要
The automatic extraction of temporal information from written texts is pivotal for many Natural Language Processing applications such as question answering, text summarisation and information retrieval. It allows to filter information and infer temporal flows of events.This paper presents ManTIME, a general domain temporal expression identification and normalisation system, and systematically explores the impact of different features and training corpora on the performance. The identification phase combines the use of conditional random fields along with a post-processing pipeline, whereas the normalisation phase is carried out using NorMA, an open-source rule-based temporal normaliser.We investigate the performance variation with respect to different feature types. Specifically, we show that the use of WordNet-based features in the identification task negatively affects the overall performance, and that there is no statistically significant difference in the results based on gazetteers, shallow parsing and propositional noun phrases labels on top of the morpho-lexical features. We also show that the use of silver data (alone or in addition to the human-annotated ones) does not improve the performance.We evaluate six combinations of training data and post-processing pipeline with respect to the TempEval-3 benchmark test set. The best run achieved 0.95 (precision), 0.85 (recall) and 0.90 (Fβ=1) in the identification phase. Normalisation accuracies are 0.86 (for type attribute) and 0.77 (for value attribute).The proposed approach ranked 3rd in the TempEval-3 challenge (task A) as the best performing machine learning-based system among 21 participants.
论文关键词:Text mining,Mining methods and algorithms,Data mining
论文评审过程:Received 26 August 2013, Revised 13 July 2015, Accepted 5 September 2015, Available online 25 September 2015, Version of Record 10 November 2015.
论文官网地址:https://doi.org/10.1016/j.datak.2015.09.002