The use of trigram analysis for spelling error detection

作者:

Highlights:

摘要

Work performed under the SPElling Error Detection COrrection Project (SPEEDCOP) supported by National Science Foundation (NSF) at Chemical Abstracts Service (CAS) to devise effective automatic methods of detecting and correcting misspellings in scholarly and scientific text is described. The investigation was applied to 50,000 word/misspelling pairs collected from six datasets (Chemical Industry Notes (CIN), Biological Abstracts (BA). Chemical Abstracts (CA), Americal Chemical Society primary journal keyboarding (ACS), Information Science Abstracts (ISA), and Distributed On-Line Editing (DOLE) (a CAS internal dataset especially suited to spelling error studies). The purpose of this study was to determine the utility of trigram analysis in the automatic detection and/or correction of misspellings. Computer programs were developed to collect data on trigram distribution in each dataset and to explore the potential of trigram analysis for detecting spelling errors, verifying correctly-spelled words, locating the error site within a misspelling, and distinguishing between the basic kinds of spelling errors. The results of the trigram analysis were largely independent of the dataset to which it was applied but trigram compositions varied with the dataset. The trigram analysis technique developed determined the error site within a misspelling accurately, but did not distinguish effectively between different error types or between valid words and misspellings. However, methods for increasing its accuracy are suggested.

论文关键词:

论文评审过程:Received 18 June 1981, Available online 18 July 2002.

论文官网地址:https://doi.org/10.1016/0306-4573(81)90044-3