Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-06-09
2002-11-12
Coby, Frantz (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06480838
ABSTRACT:
FIELD OF THE INVENTION
This invention relates to methods, apparatus and computer products for computer database searching and, more particularly, methods, apparatus and computer products for searching documents created using optical character recognition techniques.
BACKGROUND OF THE INVENTION
Much of the information upon which business and government rely is, and has been, stored on paper. With the advent of readily accessible wide area networks, high-speed optical scanners, and cheap mass storage, there has been an attempt in recent years to make paper information machine-accessible.
Machine-accessible information has many advantages over paper. Electronic data storage is far less expensive than filing cabinets in storage rooms, especially once rent is considered. Retrieval times are measured in seconds or tenths of seconds rather than minutes, hours, or even days, particularly for information in large archives. Information replication is trivial, and multiple people can access a single document simultaneously. Unfortunately, the task of converting the mass of existing paper information into machine-accessible form is daunting.
One method scans each document using an optical scanner and automatically processes each document as it is scanned. An optical scanner creates an electronic image of a document. Optical character recognition (OCR) software processes the electronic image and creates an electronic text file representing the document. “Indexing” software reads each text file and creates an index for all of the documents. A search program can then use the index to locate documents that contain a specified word, or combination of words. The process of indexing and searching documents is referred to as full-text indexing and retrieval.
Full-text indexing and retrieval has two powerful assets: it is fully automatic (and thus relatively inexpensive), and is based directly upon the actual contents of the document scanned. High-end retrieval systems may include context sensitivity, which permits the location of documents that contain related words, in situations where a user specifies the subject of a document but not its exact phrasing. World Wide Web search engines use full-text retrieval engines to search millions of electronic documents.
Search engines sometimes fail to locate documents that have been created using scanners and OCR software. This is due to the existence of numerous errors in large databases made up of scanned documents. A large database may include more than a million documents and ten million pages. To search for a document, a user must specify a combination of words, perhaps three or more, that either make a document unique, or at least restrict the list of search results to a manageable size. If a potential target document includes errors in the keywords used for the search, the search engine will not locate the document. OCR programs often produce several errors per page. An example of such an error would be a letter, e.g., an upper case “I”, misrepresented as a similar letter, e.g. a lower case “l” (el).
One solution to the problem is a “fuzzy search.” Fuzzy searching is based on the concept that words containing errors are structurally similar to the true version of the word. For example, “internet” and “intemet” are structurally similar. The first word can be changed into the second by deleting one letter and substituting an “m” for the other. Fuzzy search routines count the changes necessary to change one word into another. If few enough changes are required, a match is reported. This is computationally expensive because, during a search, every unique word in the database is individually compared to the key word to determine whether there is a match. Because OCR errors frequently produce “unique words,” the database containing the full-text index of a large archive can have more than a million unique words to compare to each key word. Even on a fast server, such a search takes time.
In addition to the amount of time it takes, fuzzy searching can result in a large volume of “hits.” In a large database, many searches return thousands of matches. “Internet” is similar to “intemet,” but so is “intem,” “undernet”, and even “international”. A search for “boat” might match “coat,” even though an OCR program is very unlikely to confuse a “b” for a “c.”
It is desirable to have a mechanism that allows a search engine to accurately locate electronic documents that have been created using OCR software. Preferably, such a mechanism will recognize errors that are typically produced by OCR software and account for errors having the highest probability of occurrence. Additionally, a preferable mechanism will minimize the amount of processing that occurs when a search is requested by a user, in order to reduce the time of each search.
SUMMARY OF THE INVENTION
In accordance with this invention, a method and computer product for processing a search request in order to compensate for characters and character strings improperly interpreted during optical character recognition (OCR) scanning is provided. After an alphanumeric search request is received, the mechanism of the invention determines variant words associated with the received alphanumeric search request according to a predefined table of possible OCR substitutions, the OCR substitutions' probability of occurrence, and a predefined threshold of probability of occurrences. A database with OCR scanned documents is then searched for the variant words.
In accordance with other aspects of the invention, variant words are determined by determining word segments that represent OCR interpretations of portions of the search request. A cumulative probability for each word segment is determined and, if the cumulative probability for a word segment is below a predetermined threshold, the word segment is rejected as a variant word.
In accordance with further aspects of the invention, a tree data structure is created, having branch nodes and substitution nodes. Each branch node represents a possible delineation of a character during OCR processing. Each substitution node represents a possible OCR substitution for the character corresponding to the parent branch node. The substitution nodes along a path from the root to a leaf node form a variant word. The cumulative probability for a substitution node is determined by multiplying the probability of occurrence for the node by the cumulative probability of occurrence for the node's grandparent substitution node.
As will be readily appreciated from the foregoing summary, the invention provides a new and improved method, apparatus and computer product for word searching of electronic documents produced using optical character recognition. The invention reduces the number of documents that are missed during a search due to OCR errors when the documents are originally translated into electronic form. The invention also reduces the amount of time required to perform a search by minimizing the amount of processing that is performed after the search request is received. Finally, because the variant words constructed in this manner are rarely legitimate words in the natural language of the database, the number of false “hits” is greatly reduced.
REFERENCES:
patent: 3688266 (1972-08-01), Watanabe et al.
patent: 3969698 (1976-07-01), Bollinger et al.
patent: 4654875 (1987-03-01), Srihari et al.
patent: 4977602 (1990-12-01), Beato
patent: 4985863 (1991-01-01), Fujisawa et al.
patent: 5418864 (1995-05-01), Murdock et al.
patent: 5461488 (1995-10-01), Witek
patent: 5465309 (1995-11-01), Johnson
patent: 5600835 (1997-02-01), Garland et al.
patent: 5606690 (1997-02-01), Hunter et al.
patent: 5625719 (1997-04-01), Fast et al.
patent: 5625721 (1997-04-01), Lopresti et al.
patent: 5802515 (1998-09-01), Adar et al.
patent: 5875263 (1999-02-01), Froessl
Christensen O'Connor Johnson & Kindness PLLC
Coby Frantz
LandOfFree
System and method for searching electronic documents created... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for searching electronic documents created..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for searching electronic documents created... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2977750