Image analysis – Pattern recognition – Context analysis or word recognition
Reexamination Certificate
1998-03-12
2001-07-31
Boudreau, Leo (Department: 2621)
Image analysis
Pattern recognition
Context analysis or word recognition
C382S177000, C382S231000, C382S305000, C382S306000, C382S309000, C382S310000, C704S251000, C707S793000
Reexamination Certificate
active
06269188
ABSTRACT:
TECHNICAL FIELD
This invention pertains to the field of data storage and filing systems, more specifically, to those systems employing optical character recognition.
BACKGROUND ART
The field of document imaging is growing rapidly, as modem society becomes more and more digital. Documents are stored in digital format on databases, providing instantaneous access, minimal physical storage space, and secure storage. Today's society now faces questions on how best to transfer its paper documents into the digital medium.
The most popular method of digitizing paper documents involves using a system comprising a scanner and a computer. The paper documents are fed into a scanner, which creates a bitmap image of the paper document. This bitmap image is then stored in the computer. The computer can take a variety of forms, including a single personal computer (PC) or a network of computers using a central storage device. The bitmapped images must be able to be retrieved after they are stored. One system for filing and retrieving documents provides a user interface which allows a user to type in a search term to retrieve documents containing the search term. Preferably, the system allows the user to type in any word that the user remembers is contained within the desired document to retrieve the desired document. However, in order to retrieve documents on this basis, the document must be character recognized. That is, the computer must recognize characters within the bitmapped image created by the scanner.
Another common usage of digitizing documents is to digitize long paper documents in order to allow the document to be text searched by the computer. In this usage, a user types in the key word the user is looking for within the document, and the system must match the search term with words found within the document. For these systems, the document must be character recognized as well.
The most common method of recognizing characters is by using an optical character recognition (OCR) technique. An optical character recognition technique extracts character information from the bitmapped image. There are many different types of optical character recognition techniques. Each has its own strengths and weaknesses. For example, OCR
1
may recognize handwriting particularly accurately. OCR
2
may recognize the Courier font well. If OCR
1
is used to recognize a document in Courier font, it may still recognize the majority of the characters in the document. However, it may recognize many of the characters inaccurately. A user may not know of an OCR's strengths and weaknesses. A user may not know whether or not the types of documents the user typically generates are of the kind that are accurately recognized by the OCR present on the user's system. Current systems do not inform the user of the quality of the recognition of the OCR technique. The user finds out how accurate the recognition was only by using the document for the purpose for which it was stored into the computer system, at which time it may be too late to correct.
An inaccurately recognized document can lead to several problems. First of all, in a system in which documents are stored and retrieved based on their contents, an inaccurately recognized document may become impossible to retrieve. For example, if a user believes the word “imaging” is in a specific document, the user will type in “imaging” as the search term. However, if the word “imaging” is recognized incorrectly, such that it was recognized as “emerging,” the user's search will not retrieve the desired document. The user may not remember any other words in the document, and thus the document is unretrievable. In a system where documents are digitized to allow text searching of the document, the same problem occurs. Misrecognized words are not found by the use of the correct search terms.
Thus, there is a need to allow the user to determine whether a recognized word is of acceptable quality. By allowing the user to determine whether a word is of acceptable quality, the user can ensure that the document is retrieved by the use of that word as a search term. Also, a user can ensure that words within the document are accurately recognized for internal document searching purposes. Additionally, in a system with multiple optical character recognition techniques, there is a need to be able to compare the accuracy of the different versions of the document to create a version that is the most accurate.
DISCLOSURE OF THE INVENTION
The present invention is a computer-implemented method for calculating word grouping accuracy values (
260
). The present invention receives (
200
) data, performs (
204
) an optical character recognition technique upon the received data, and creates (
208
) word groupings. The system then calculates (
212
) word grouping accuracy values (
260
) for the created word groupings.
Word grouping accuracy values (
260
) are calculated (
212
) by using character accuracy values (
250
) determined by the OCR technique. The present invention preferably uses these character accuracy values (
250
) to create a word grouping accuracy value (
260
). Various methods are employed to calculate the word accuracy (
260
), including binarizing the character accuracy values (
250
), modified averaging of the character accuracy values (
250
), and employing fuzzy visual displays of word grouping accuracy values (
260
). The calculated word grouping accuracy values (
260
) are adjusted based upon known OCR strengths and weaknesses, and based upon comparisons to stored word lists and the application of language rules. Word grouping accuracy values (
260
) are normalized and displayed or compared to a threshold. The words whose accuracy values (
260
) exceed the threshold may then be used to index the documents or provide search terms for searching within the document. If no word groupings exceed the threshold then the user is offered different options, including to clean the image by performing another OCR or scanning the document again, or to reset the threshold to a lower value.
REFERENCES:
patent: 3969698 (1976-07-01), Bollinger et al.
patent: 4941125 (1990-07-01), Boyne
patent: 4949287 (1990-08-01), Yamaguchi et al.
patent: 5040218 (1991-08-01), Vitale et al.
patent: 5303361 (1994-04-01), Colwell et al.
patent: 5359667 (1994-10-01), Borowski et al.
patent: 5369742 (1994-11-01), Kurosu et al.
patent: 5375235 (1994-12-01), Berry et al.
patent: 5418946 (1995-05-01), Mori
patent: 5526443 (1996-06-01), Nakayama
patent: 5555362 (1996-09-01), Yamashita et al.
patent: 5617488 (1997-04-01), Hong et al.
patent: 5628003 (1997-05-01), Fujisawa et al.
patent: 5642288 (1997-06-01), Leung et al.
patent: 5675665 (1997-10-01), Lyon
patent: 5687250 (1997-11-01), Curley et al.
patent: 5757983 (1998-05-01), Kawaguchi et al.
patent: 5764799 (1998-06-01), Hong et al.
patent: 5774580 (1998-06-01), Saitoh
patent: 5781658 (1998-07-01), O'Gorman
patent: 5781879 (1998-07-01), Arnold et al.
patent: 5805747 (1998-09-01), Bradford
patent: 5818952 (1998-10-01), Takenouchi et al.
patent: 5832470 (1998-11-01), Morita et al.
patent: 5848184 (1998-12-01), Taylor et al.
patent: 5850480 (1998-12-01), Scanlon
patent: 5878385 (1999-03-01), Bralich et al.
patent: 5905811 (1999-05-01), Shiiyama et al.
patent: 5926565 (1999-07-01), Froessl
patent: 5933531 (1999-08-01), Lorie
patent: 5943443 (1999-08-01), Itonori et al.
patent: 5999664 (1999-12-01), Mahoney et al.
patent: 6002798 (1999-12-01), Palmer et al.
patent: 6005973 (1999-12-01), Seybold et al.
patent: 6006226 (1999-12-01), Cullen et al.
patent: 6023528 (2000-02-01), Froessl
patent: WO 97/22947 (1997-06-01), None
Gorgevik, D. et al., “Word Candidate Generation in Cyrillic OCR Based on ALN Classifiers”, IEEE Bipolar/Bicmos Circuits and Tech. Meeting, 1998, pp. 870-874, XP-000802005, ISBN: 0-7803-4498-7.
Raza, G. et al., “Recognition of poor quality words without segmentation”, IEEE International Conference On Systems, Man and Cybernetics, Oct. 14, 1996, pp. 64-69, XP000729923, ISBN: 0-7803-3281-4.
Hull, J.J., “Performance Evalu
Boudreau Leo
Canon Kabushiki Kaisha
Fenwick & West LLP
Mariam Daniel G.
LandOfFree
Word grouping accuracy value generation does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Word grouping accuracy value generation, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Word grouping accuracy value generation will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2507878