Experiments in discourse analysis impact on information classification and retrieval algorithms
作者:
Highlights:
•
摘要
Researchers in indexing and retrieval systems have been advocating the inclusion of more contextual information to improve results. The proliferation of full-text databases and advances in computer storage capacity have made it possible to carry out text analysis by means of linguistic and extra-linguistic knowledge. Since the mid 80s, research has tended to pay more attention to context, giving discourse analysis a more central role. The research presented in this paper aims to check whether discourse variables have an impact on modern information retrieval and classification algorithms. In order to evaluate this hypothesis, a functional framework for information analysis in an automated environment has been proposed, where the n-grams (filtering) and the k-means and Chen’s classification algorithms have been tested against sub-collections of documents based on the following discourse variables: “Genre”, “Register”, “Domain terminology”, and “Document structure”. The results obtained with the algorithms for the different sub-collections were compared to the MeSH information structure. These demonstrate that n-grams does not appear to have a clear dependence on discourse variables, though the k-means classification algorithm does, but only on domain terminology and document structure, and finally Chen’s algorithm has a clear dependence on all of the discourse variables. This information could be used to design better classification algorithms, where discourse variables should be taken into account. Other minor conclusions drawn from these results are also presented.
论文关键词:Discourse-model,Context-analysis,Computational-linguistics,Text-analysis-methods,Filtering,n-grams,k-means,Co-wording
论文评审过程:Received 1 February 2002, Accepted 13 August 2002, Available online 19 December 2002.
论文官网地址:https://doi.org/10.1016/S0306-4573(02)00081-X