Image analysis – Editing – error checking – or correction – Correcting alphanumeric recognition errors
Reexamination Certificate
1998-02-05
2001-03-20
Tran, Phuoc (Department: 2721)
Image analysis
Editing, error checking, or correction
Correcting alphanumeric recognition errors
C382S187000, C382S189000, C382S309000, C704S251000
Reexamination Certificate
active
06205261
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention is directed to a method and system for correcting misrecognized words in electronic documents that have been produced by an optical character recognition system that scans text appearing on a physical medium, and in particular, to a method and system that relies on a plurality of confusion sets to select a replacement word for each misrecognized word in the document.
Devices that are used in conjunction with optical character recognition (“OCR”) techniques have been in use for some time. Examples of such devices are optical scanners and facsimile machines. What is common to both of these types of devices is that they each scan a physical document bearing printed or handwritten characters in order to produce an electronic image of the original document. The output image is then supplied to a computer or other processing device, which performs an OCR algorithm on the scanned image. The purpose of the OCR algorithm is to produce an electronic document comprising a collection of recognized words that are capable of being edited. The electronic document may be formatted in any one of a plurality of well known applications. For example, if the recognized words are to be displayed on a computer monitor, they may be displayed as a Microsoft WORD® document, a WORDPERFECT® document, or any other text-based document. Regardless of how the recognized words of the electronic document are formatted, the recognized words are intended to correspond exactly, in spelling and in arrangement, to the words printed on the original document.
Such exact correspondence, however, does not always occur; as a result, the electronic document may include misrecognized words that never appeared in the original document. For purposes of this discussion, the term “word” covers any set of characters, whether or not the set of characters corresponds to an actual word of a language. Moreover, the term “word” covers sets of characters that include not only letters of the alphabet, but also numbers, punctuation marks, and such typographic symbols as “$”, “&”, “#”, etc. Thus, a misrecognized word may comprise a set of characters that does not comprise an actual word, or a misrecognized word may comprise an actual word that does not have the same spelling as that of the corresponding word in the scanned document. For example, the word “got” may be misrecognized as the non-existent word “qot”, or the word “eat” may be recognized as “cat.” Such misrecognized words, whether they comprise a real word or a mere aggregation of characters, may be quite close in spelling to the words of the original document that they were intended to match. The cause of such misrecognition errors is largely due to the physical similarities between certain characters. For example, as discussed above, such errors may occur when the letter “g” is confused with the physically similar letter “q”. Another common error that OCR algorithms make is confusing the letter “d” with the two-letter combination of “ol.” The physical resemblance of certain characters is not the only cause of recognition errors, however. For example, the scanning device may include a faulty optical system or a defective charge-coupled device (CCD); the original document may be printed in a hard-to-scan font; or the original document may include scribbles and marks that obscure the actual text.
Certain techniques have been implemented to detect and correct such misrecognition errors. For example, if the electronic document containing the recognized words is formatted in a word processing application, a user viewing the document may use the spell checking function provided by the word processing application to correct any words that have been misspelled. Some of these word processing applications also provide a grammar checker, which would identify words that, although spelled correctly, do not belong in the particular sentences in which they appear.
A drawback to these techniques is that a user must manually implement these correction techniques because spell checkers and grammar checkers operate by displaying to the user a list of possible words that may include the correct word. By manipulating an appropriate sequence of keys or other data input means, a user must select from this list what he believes to be the correct word and implement the appropriate commands for replacing the misrecognized word with the selected word. Such a correction technique is time-consuming, and moreover, is prone to human error because, in carrying out such operations, the user may inadvertently select an inappropriate word to replace the misrecognized word. What is therefore needed is a correction technique that automatically replaces each misrecognized word with the word most likely matching the corresponding word in the original document. Such a correction technique would not require user intervention.
SUMMARY OF THE INVENTION
In order to overcome the above-mentioned disadvantages found in previous techniques for correcting misrecognized words, the present invention is directed to a method and apparatus that automatically substitutes each misrecognized word with a dynamically generated replacement word that has been determined to be the most likely correct word for replacing the misrecognized word. The recognized words may be based on words appearing on a physical medium (e.g., a sheet of paper) that has been optically scanned. The present invention then determines whether each recognized word is correct by executing either a spell checking algorithm, a grammar checking algorithm, a natural language algorithm, or any combination thereof. For each incorrect recognized (i.e., misrecognized) word, the present invention generates at least one reference word; the misrecognized word is replaced by one of the reference words. In order to determine which reference word is to replace the misrecognized word, the present invention provides a plurality of confusion sets. Each confusion set includes constituent elements, otherwise referred to as character members, which correspond not only to individual characters, but also to multi-character combinations. The purpose of grouping together these character members in different confusion sets is to group together those character members having a relatively high probability of being confused with each other by the OCR application. The manner in which these confusion sets are generated is such that characters or character combinations from different confusion sets have a relatively low probability of being confused with each other. The determination of which characters should be grouped together is based on the recognition probabilities arranged in a confusion matrix.
Based on the provided confusion sets, the present invention compares each character sequence of the misrecognized word with a corresponding character sequence of each reference word to determine which corresponding character sequences do not include the same character members. If a character sequence of a reference word includes a character member that is different than the character member in the corresponding character sequence of the misrecognized word, then that reference word will be eliminated from further consideration if the differing character members in the misrecognized word and reference word are not from the same confusion set. The remaining non-eliminated reference words are referred to collectively as a set of candidate reference words; the present invention reduces the set of candidate reference words to a single reference word in accordance with a set of predetermined criteria and then replaces the misrecognized word with the reference word remaining in the set of candidate reference words.
REFERENCES:
patent: 3988715 (1976-10-01), Mullan et al.
patent: 4718102 (1988-01-01), Crane et al.
patent: 4783804 (1988-11-01), Juang et al.
patent: 4817156 (1989-03-01), Bahl et al.
patent: 4819271 (1989-04-01), Bahl et al.
patent: 4908865 (1990-03-01), Doddington et al.
patent: 5023912 (1991-06-01), Segawa
patent: 5027406 (1991-06-01
AT&T Corp.
Kenyon & Kenyon
Mariam Daniel G.
Tran Phuoc
LandOfFree
Confusion set based method and system for correcting... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Confusion set based method and system for correcting..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Confusion set based method and system for correcting... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2530495