A fast method of determining weighted compound keywords from text databases

作者:

Highlights:

摘要

In document management systems, many compound words which are invented freely can be keyword candidates. There are two types of criterions for keyword construction: an individual word and a sequence of words. The selection of these criterions depends on the system for extracting keywords. Since the method should process many operations for appending, separating or comparing of keyword candidates, it is important to prepare an efficient method to extract keywords with information about their relationships among them. This paper presents a technique for storing compound keywords with information about both short component keywords and long component keywords by extending Aho and Corasick (AC) string pattern matching machine for a finite number of keywords. By theoretical analysis, it is verified that the total cost of the extended AC machine becomes O(n+k) in comparison with the total cost O(3n) of the original AC machine, where n is the sum of the lengths of key-words and k is the number of key-words. By simulation results for 38 Japanese text files, it is shown that the extended AC machine is about three to six times faster than the original AC machine in SC and LC keyword processing.

论文关键词:

论文评审过程:Received 1 February 1997, Accepted 1 January 1998, Available online 21 October 1998.

论文官网地址:https://doi.org/10.1016/S0306-4573(98)00012-0