Tos: A text organizing system | 数据学习(DataLearner)

摘要

This paper reports research undertaken to conceptualize, design and implement a system for automatic indexing, classification and repositing of text items, which may be any aggregate of information in English language on a computer-readable media, in a standard format.The ultimate goal of the research reported here is to devise all automatic processes which would read text items, and then index, classify and reposit them for subsequent search and retrieval. Only portions of the path to this goal have been made fully automatic. These portions consist of all automatic processes as follows: 1.(1) Analyzing the text items and assigning candidate index terms to the items;2.(2) Generating and assigning candidate index phrases to the items;3.(3) Discriminating and rejecting candidate index terms determined to be ineffective in forming a classification automatically; and4.(4) Generating a classification system and repositing the text items in accordance with this system.To complete the process, some degree of user involvement, on an interactive basis, is incorporated in the system, particularly for discriminating the index terms which do not contribute to a satisfactory classification. Based on various reports derived automatically, the user can guide the system to systematically search for terms which are not helpful for and even hamper the subsequent classification and information retrieval, until the performance of the system is judged to be adequate. The specific achievements of the reported research are stated below:1.(1) System interactiveness;2.(2) Automatic index phrase recognition;3.(3) Summary report, informing the user of the impact of user elected decisions to delete terms on a mass basis and advising him of percentages of reduction in index term vocabulary size or average number of index terms per item resulting from such mass term deletions;4.(4) Affinity dictionary, giving the user the ability to locate synonymous or near synonymous index terms;5.(5) Use of classification processes in discriminating unsuitable index terms;6.(6) An integrated automatic indexing and classification system; and7.(7) Successful automatic indexing and classification of a textual data-base.The system has been adequately documented (including a user guide) and tested for its reliability and dependability.The research was conducted in the Moore School of Electrical Engineering, University of Pennsylvania and utilized the UNIVAC Spectra 70/46 computer, operating with the Univac VMOS and DMS. The system has been implemented in Univac version of FORTRAN IV.