Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2001-07-30
2004-02-03
Corrielus, Jean M. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000
Reexamination Certificate
active
06687697
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to digital images, and more particularly to searching for objects, text or handwriting within a static digital image document, real-time stroke data, or the like.
BACKGROUND OF THE INVENTION
Technology today provides many ways for people to trade electronic documents, such as by disk, e-mail, network file transfers, and the like. In addition, many systems are available that allow a hard copy of a document to be digitized and made electronically available, such as through the use of optical scanners or fax machines. One problem with the digitized version of the document is that the electronic file is typically an image file rather than a textual file, and hence is much more difficult to edit by computer. As important, the image files cannot be electronically searched for instances of a particular text string or other string. Rather, generally, the user is left to manually view the image file representation of the document looking for the desired term. Obviously, this particular method is labor intensive and subject to human error.
Consumer software applications may include an optical character recognition (OCR) component to convert the image file to a textual file. Using OCR applications allows a user to search for particular instances of a query string, however, the confidence of actually finding every instance of that query string may be low. The recognition process occasionally mis-recognizes letters or combinations of letters with similar shapes, causing errors to appear in the resulting text. Typical error rates on a high-quality image can vary widely depending on the complexity of the layout, scan resolution, and the like. On average, for common types of documents, error rates for OCR are often in the range of 1% to 10% of the total characters on the page. These errors greatly diminish the user's confidence of locating every instance of a query string from within a document that started out as an image file. A solution to this problem has eluded those skilled in the art.
SUMMARY OF THE INVENTION
Briefly stated, the present invention provides a system and method for improved string matching within a document created under noisy channel conditions. The invention provides a method for identifying, within a document created by a noisy conversion process (e.g., OCR), potential matches to a user-defined query and the likelihood that the potential matches satisfy the query. Satisfaction can be determined by identifying whether any difference between the potential match and the query is likely the result of an error in generating the document. That identification may be made with reference to a pre-constructed table containing data indicating the probability that a particular error occurred during the noisy document conversion. Additionally, the invention provides optional steps to further assess the likelihood of the match. Such optional steps may include the use of OCR confidence data, word heuristics, language models, and the like.
In one aspect, the invention provides a system for identifying string candidates and analyzing the probability that the string candidate matches a user-defined query string. In one implementation, a document text file is created to represent a document image file through a noisy conversion process, such as OCR. A find engine searches for matches to a query string to within a defined tolerance. Any match that differs from the query string by no more than the defined tolerance is identified as a candidate. The find engine then analyzes the difference between each candidate and the query string to determine if the difference was likely caused by an error in the noisy process. In that determination, reference is made to a confusion table that associates common errors in the noisy process with probabilities that those errors occurred. Candidates meeting a probability threshold are identified as a match. Optionally, this probability threshold may be adjusted by the user to dynamically narrow or widen the scope of possible matches returned by the find engine. The invention further provides for analysis options including word heuristics, language models, and OCR confidences.
In another aspect, the invention may be implemented as a computer-readable medium, having computer executable-instructions for performing steps including receiving a query string request to locate every instance of the query string in a document image file, converting the document image file into a document text file, parsing the document text file to identify data strings that may be the query string, and analyzing the data strings to identify a probability that each of the data strings is the query string.
REFERENCES:
Myers, G. “A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming,”Lecture Notes in Computer Science, Issue 1448, 1998, pp. 1-13.
Collins-Thompson Kevyn
Schweizer Charles B.
Corrielus Jean M.
Microsoft Corporation
LandOfFree
System and method for improved string matching under noisy... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for improved string matching under noisy..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for improved string matching under noisy... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3276688