Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1999-05-12
2003-12-09
Homere, Jean R. (Department: 2177)
Data processing: database and file management or data structures
Database design
Data structure types
C382S159000, C382S229000, C382S321000
Reexamination Certificate
active
06662180
ABSTRACT:
TECHNICAL FIELD
The present invention relates generally to the searching for query keywords in large databases of automatically recognized text and, more specifically, to a method for searching using approximate matching techniques which compensate for errors in text generated by an optical character recognition (OCR) system or a speech recognition (SR) system. The invention indexes the databases in a way that facilitates and speeds up the retrieval process.
BACKGROUND OF THE INVENTION
Large text retrieval systems are often built by extracting text information from documents using OCR or from spoken words, using a speech recognition system and inserting the extracted text into a database. OCR and SR devices are prone to errors and hence the database may contain erroneous words. This makes it difficult to retrieve documents that contain query words given by a user.
Today, it is common practice to store a large number of paper documents into a database. These documents need to be retrieved later by searching for some words that appear in the document. One way to achieve this is by extracting words from digitized documents using OCR technology. These words are then stored into the database and are used for searching and retrieval.
In addition, speech recognition systems are becoming more widely used. With a commercially available speech recognition system, a user may dictate a new document for the database or convert an existing document from text to machine readable form by simply reading the document into a microphone connected to a personal computer. While these systems are generally accurate with sufficient training, it is well known that some errors do occur. Typically, the speech recognition system allows these errors to be corrected manually by the user. A user may not, however, detect all of the errors and, so, the resulting database may include words which are not in the original text document.
Because neither OCR technology nor speech recognition technology is 100% error free, the scanned database may contain words that were misspelled in the conversion process. The conventional method of using direct searching techniques may not locate all of the appropriate documents in the database that contain a given query word because the corresponding word in the database is misspelled in some of the documents.
Another problem with the conventional method of using direct searching techniques is that the size of the database may be very large, and hence the search process may be very slow.
The conventional solution is insufficient to find words in the database that have been recognized incorrectly and are misspelled. Also, the conventional solution is too slow in searching large database. There is a need for a more efficient search algorithm.
SUMMARY OF THE INVENTION
To meet this and other needs, and in view of its purposes, the present invention provides a simple and effective method for searching for a query word in a hierarchical data structure. The data structure has branch nodes and leaf nodes, each branch node represents a respective portion of one or more words and each leaf node represents a word. The data structure is searched for each query word by selecting the first letter of the query word and also selecting a root node in the hierarchical data structure as the current node. All possible child nodes of the current node are identified. Respective estimated probability values for matching respective components of the query word with the components associated with the nodes in the path taken through the hierarchical data structure is calculated for each identified child node. The identified child nodes are then added to a list of candidate nodes The candidate node with the highest probability value is selected as the current node and is then deleted from the list of candidate nodes. If a leaf node has been reached, then a determination is made whether to store the word into a list of best matches. Processing repeats itself for each portion of the query word. When all portions of the query word have been matched, the matched words with their respective probability values are stored into the list of best matches.
It is to be understood that both the foregoing general description and the following detailed description are exemplary, but are not restrictive, of the invention.
REFERENCES:
patent: 3969698 (1976-07-01), Bollinger et al.
patent: 5075896 (1991-12-01), Wilcox et al.
patent: 5257323 (1993-10-01), Melen et al.
patent: 5377281 (1994-12-01), Ballard et al.
patent: 5418864 (1995-05-01), Murdock et al.
patent: 5455872 (1995-10-01), Bradley
patent: 5500920 (1996-03-01), Kupiec
patent: 5524240 (1996-06-01), Barbará et al.
patent: 5528701 (1996-06-01), Aref
patent: 5553284 (1996-09-01), Barbará et al.
patent: 5649023 (1997-07-01), Barbará et al.
patent: 5768423 (1998-06-01), Aref et al.
patent: 5850480 (1998-12-01), Scanlon
patent: 5889897 (1999-03-01), Medina
patent: 5950158 (1999-09-01), Wang
patent: 5963902 (1999-10-01), Wang
patent: 6047093 (2000-04-01), Lopresti et al.
patent: 6304674 (2001-10-01), Cass et al.
patent: 6351574 (2002-02-01), Yair et al.
Katz, S.M. “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-35, No. 3, Mar. 1987.*
Seymore, K. and R. Rosenfeld “Scalable Backoff Language Models”, Proceedings of the 4thInternational Conference on Spoken Language Processing (ICSLP 96), pp. 232-235, Oct. 3-6, 1996.*
Nivre, J. “Sparse Data and Smoothing in Statistical Part-of-Speech Tagging”, Journal of Quantitative Linguistics, vol. 7, No. 1, pp. 1-17, 2000.*
Garica-Varea, I, Casacuberta, F. and Ney, H. “An Iterative, DP-Based Search Algorithm for Statistical Machine Translation”, Proceedings of the 5thInternational Conference on Spoken Language Proceesing (ICSLP '98), pp. 1135-1138, v.4, Dec. 1998.*
Seymore, K. and Rosenfeld, R. “Scalable Backoff Language Models”, Proceedings of the 4thInternational Conference on Spoken Language Processing (ICSLP '96), pp. 232-235, Oct. 3-6, 1996.*
Seymore, K. and Rosenfeld, R. “Scalable Trigram Backoff Language Models”, Technical Report CMU-CS-96-139, School of Computer Science, Carnegie Mellon University, May 1996.*
Charniak, E. et al., “Equations for Part-of-Speech Tagging”, Proceedings of the 11th National Conference on Artificial Intelligence, pp. 784-789, 1993.*
Aho et al. “Data Structures and Algorithms”, Reading:Addison-Wesley, 1982, pp. 163-169, QA76.9.D35A38 1982.*
Aref et al., “The Handwritten Trie: Indexing Electronic Ink”, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, May 22-25, 1995, pp. 151-162.*
Kamel et al. “Retrieving Electronic Ink by Content”, Proceedings of the 1996 IEEE International Workshop on Multimedia Database Management Systems, Aug. 14-16, 1996, pp. 54-61.*
Knuth, Donald E., The Art of Computer Programming: vol. 3/Sorting and Searching, Reading:Addison-Wesley, 1997, pp. 492-512, QA76.6.K64 1997.*
H. Shang et al.; “Tries for Approximate String Matching” IEEE Transactions on Knowledge and Data Engineering, Aug. 1996, IEEE, USA, vol. 8 No. 4, pp. 540-547.
Andrew P. Berman; “A New Data Structure for Fast Approximate Matching”; Technical Report Mar. 2, 1994; pp. 1-10; Department of Computer Science and Engineering, University of Washington; http://citeseer.nj.nec.com/berman94ne.
EPO Search Report, Jan. 25, 2002.
Aref Walid G.
Kanai Junichi
Homere Jean R.
Matsushita Electric - Industrial Co., Ltd.
RatnerPrestia
Wassum Luke S
LandOfFree
Method for searching in large databases of automatically... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for searching in large databases of automatically..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for searching in large databases of automatically... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3110049