Character matching process for text converted from images

Image analysis – Pattern recognition – Context analysis or word recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S310000, C382S311000

Reexamination Certificate

active

06668085

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to the field of electronic text processing of text derived from optical character recognition (OCR) or intelligent character recognition (ICR) devices. More specifically, this invention relates to a method and apparatus for generating reasonably possible electronic text from a text containing possible errors derived from images by optical character recognition or intelligent character recognition, and to selecting the correct text from the set of possible texts.
2. Discussion of Related Art
Document processing systems employing optical character recognition (OCR) and intelligent character recognition (ICR) devices for scanning and storing the contents of documents are well known in the art. In a typical document processing system of this nature, documents are fed into a transport scanning device which serially scans each document, stores the data and passes the document to other devices for further processing. The scanned image of each document is converted into a bit-map, i.e., digitized image data, of the entire document. The bit-mapped image data is then transmitted to a character recognition engine where the image data is analyzed in an attempt to convert desired portions of the image data into discrete electronic text characters through character recognition. If the data is successfully recognized as one or more alphanumeric characters, it is transformed into discrete alphanumeric characters for storage and future processing. For example, data thus converted into the alphanumeric characters can be stored in a conventional computer database for future access and/or electronic processing without the need to further physically handle the original documents.
Document processors employing OCR and ICR devices have been utilized to facilitate processing of pre-formatted business forms with some degree of success. For example, such processors are currently used to read information printed on checks. Automated scanning and processing of checks is advantageous because the type of information contained on checks are contained within one or more discrete fields and all of the data to be scanned is of the same type, i.e., all numerals.
However, while the use of such document processors has long offered the potential for significantly reducing costly manual information processing, in practice, OCR and ICR document processors have only enjoyed limited application because they are prone to yield inaccurate results. Restated, the full benefits of wholly automated information processing have heretofore been significantly limited by the ability of OCR and ICR based document processors to accurately recognize the data contained on the above-mentioned forms.
In particular, the OCR and ICR art has continued to struggle with the problem of automated recognition of handwritten data and data of mixed alphanumeric character.
Accurate recognition of handwriting has proven to be a particularly illusive goal due to the unconstrained nature of handwriting and the large variety of handwriting styles. Thus, character recognition errors continue to severely limit the utility of document processors employing optical character recognition devices where the information to be processed has been handwritten on documents. The main error that occurs in processing is substitution errors, which occur when a given character being analyzed is incorrectly identified as another character(s). Substitution errors include (1) incorrect identification of a single character as a different character; (2) incorrect identification of a single character as multiple characters; and (3) incorrect identification of multiple characters as a single character. Because the recognition device always yields some data when a substitution error occurs, substitution errors can be difficult to detect.
Methods are known to attempt to correct such errors, but these methods are extremely limited and require excessive amounts of human intervention to solve the problem. First, errors are typically checked for on a one-to-one character replacement basis, and substitutions such as one-to-many characters, many-to-one characters and one-to-none characters are not checked, thus severely limiting the ability of the method to determine error correction.
Further, the correction methods typically involve querying a user for correction of the error, often presenting the user with an image of the error along with a set of possible corrections derived from a standard dictionary database. See, for example, U.S. Pat. No. 6,005,973, describing a method in which the process gathers the most likely character sequences associated with the error, and presents the results of the method to the user for selection of the correct character sequence.
Sometimes the dictionary database is able to correct the error to the correct text based upon a high level of confidence that it could be the only correction possible. This is only the case, however, when the converted text contains only alphabetical text that would be found in the dictionary. Where the text contains mixed alphabetical and numerical text, for example as might be found with part numbers, product codes, etc., the query to the dictionary always fails and this prior art methodology thus is inadequate to deal with such text without requiring the need for frequent human interaction. However, presenting the error to a human operator to rectify the error makes the process extremely expensive and time consuming.
U.S. Pat. No. 5,850,480 describes methods of correcting optical character recognition errors occurring during recognition of character sequences contained within one or more predetermined types of character fields. The methods may be practiced with a document processing system having (1) a optical character recognition device for scanning documents and outputting bit-map image data; (2) a recognition engine for converting the bit-map image data into possibly correct alphanumeric characters with associated confidence values; and (3) at least one lexicon of character sequences consisting of a list of at least a portion of all of the possible character sequence values for each of the fields being processed. OCR errors are corrected by performing a contextual comparison analysis between the alphanumeric characters outputted from the recognition engine and the lexicon of character sequences. However, this method is designed to work only with specific-types of texts entered into specific fields, for example address fields, of a form, looks at letters and numbers separately instead of mixed alphanumeric text, and requires assignment of confidence levels to order possible text for selection by a user.
Thus, there exists a need in the art for OCR error correction methods and apparatus capable of enhancing the accuracy of optical character recognition of machine-print and hand-print, particularly print of mixed alphanumeric characters, requiring a reduced level of human intervention for correction.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to provide an improved method, and apparatus for conducting the method, for generating versions of reasonably possible text given the text version with errors from ICR/OCR devices, particularly of text that may be of mixed alphanumeric type. It is a still further object of the present invention to conduct the method so as to reduce the amount of required human intervention required in correcting converted text with errors to correct text.
It is another object of the present invention to provide a method of deriving a set of possible correct texts from converted text with errors, and apparatus for conducting the method, in which the character substitutions examined by the method include more than just one-to-one character substitutions, but also include, for example, one-to many, many-to-many, many-to-one and one-to-none character substitutions so that the set of possible correct texts includes a larger number of possible texts of varying lengths, and thus is more

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Character matching process for text converted from images does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Character matching process for text converted from images, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Character matching process for text converted from images will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3111729

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.