Image analysis – Pattern recognition – On-line recognition of handwritten characters
Reexamination Certificate
1998-12-23
2002-05-28
Boudreau, Leo (Department: 2621)
Image analysis
Pattern recognition
On-line recognition of handwritten characters
C382S181000, C382S185000, C382S186000, C382S229000, C382S305000, C707S793000, C707S793000, C707S793000, C704S002000, C704S003000, C704S005000, C358S403000
Reexamination Certificate
active
06396951
ABSTRACT:
FIELD OF THE INVENTON
The present invention relates to obtaining query data for information retrieval.
BACKGROUND AND SUMMARY OF THE INVENTION
Most multilingual speakers can read some languages more easily than they can generate correct utterances and written expressions in those languages. When searching for information, existing information retrieval systems require that the user formulate a query in the language (target language or L
2
) of the documents and, normally, physically type in the query. Thus, as well as including a query formulation step, such systems do not allow a user to indicate their search interests in their native language (L
1
).
Ballesteros, L., and Croft, W. B., “Dictionary Methods for Cross-Lingual Information Retrieval”, in
Proceedings of the
7
th
International DEXA Conference on Database and Expert Systems
, 1996, pp. 791-801, disclose techniques in which a user can query in one language but perform retrieval across languages. Base queries drawn from a list of text retrieval topics were translated using bilingual, machine-readable dictionaries (MRDs). Pre-translation and post-translation feedback techniques were used to improve retrieval effectiveness of the dictionary translations.
EP-A-725,353 discloses a document retrieval and display system which retrieves source documents in different languages from servers linked by a communication network, translates the retrieved source documents as necessary, stores the translated documents, and displays the source documents and translated documents at a client device connected to the communication network.
U.S. Pat. No. 5,748,805 discloses a technique that provides translations for selected words in a source document. An undecoded document image is segmented into image units, and significant image units such as words are identified based on image characteristics or hand markings. For example, a user could mark difficult or unknown words in a document. The significant image units are then decoded by optical character recognition (OCR) techniques, and the decoded words can then be used to access translations in a data base. A copy of the document is then printed with translations in the margins opposite the significant words.
The invention addresses a problem that arises with information retrieval where a user has a document in one language (L
1
) and wishes to access pertinent documents or other information written in a second language (L
2
) and accessible through a query-based system. Specifically, the invention addresses the problem of generating a query that includes expressions in the second language L
2
without translating or retyping the document in the first language L
1
, referred to herein as the document-based query problem. The document-based query problem arises, for example, where the user cannot translate the document from L
1
to L
2
, where the user is unable to type or prefers not to type, where the user does not have access to a machine with a keyboard on which to type, or where the user does not know how to generate a query that includes expressions in L
2
.
The invention alleviates the document-based query problem by providing a new technique that scans the document and uses the resulting text image data. The new technique performs automatic recognition to obtain text code data with a series of element codes defining expressions in the first language. The new technique performs automatic translation on a version of the text code data to obtain translation data indicating counterpart expressions in the second language. The new technique uses the counterpart expressions in the second language to automatically obtain query data defining a query for use in information retrieval.
The new technique can be implemented with a document that is manually marked to indicate a segment of the text, and text image data defining the indicated segment can be extracted from image data defining the document.
Automatic recognition can be implemented with optical character recognition (OCR), and automatic language identification can be performed to identify the probable predominant language so that language-specific OCR can be performed. The OCR results can also be presented to the user, who can interactively modify them to obtain the text code data.
Automatic translation can be implemented with a translation dictionary. The text code data can be tokenized to obtain token data; the token data can be disambiguated to obtain disambiguated data with parts of speech for words; the disambiguated data can be lemmatized to obtain lemmatized data indicating, for each of a set of words, either the word or a lemma for the word; and the lemmatized data can be translated. Translation can be done by looking up the words and lemmas in a bilingual translation dictionary.
The query data can define the query in a format suitable for an information retrieval engine. The query data can then be provided to the information retrieval engine.
The new technique can also be implemented in a system that includes a scanning device and a processor connected for receiving image data from the scanning device. After receiving an image of a segment of text in the first language from a scanned document, the processor performs automatic recognition to obtain text code data, performs automatic translation on a version of the text code data to obtain translation data indicating expressions in the second language, and uses the expressions to automatically obtain query data defining a query for use in information retrieval.
An advantage of the invention is that it eliminates the need for knowing how information interest (or query) should be formulated in the target language, as well as eliminating the need for imagining and typing in the query. In certain embodiments of the invention, the user need only designate a portion of an existing document, e.g. a hardcopy document, which is of interest to him.
The following description, the drawings, and the claims further set forth these and other aspects, objects, features, and advantages of the invention.
REFERENCES:
patent: 5272764 (1993-12-01), Bloomberg et al.
patent: 5301109 (1994-04-01), Landauer et al.
patent: 5325091 (1994-06-01), Kaplan et al.
patent: 5523946 (1996-06-01), Kaplan et al.
patent: 5692073 (1997-11-01), Cass
patent: 5694559 (1997-12-01), Hobson et al.
patent: 5748805 (1998-05-01), Withgott et al.
patent: 5812818 (1998-09-01), Adler et al.
patent: 5890103 (1999-03-01), Carus
patent: 5978754 (1999-11-01), Kumano
patent: 6006221 (1999-12-01), Liddy et al.
patent: 6067510 (2000-05-01), Kimura et al.
patent: 0 544 434 (1993-06-01), None
patent: 0 583 083 (1994-02-01), None
patent: 0 590 858 (1994-04-01), None
patent: 0 762 298 (1995-09-01), None
patent: 0 725 353 (1996-08-01), None
patent: 0 741 487 (1996-11-01), None
patent: 050466059 (1993-02-01), None
patent: 07044564 (1995-02-01), None
patent: 07160715 (1995-06-01), None
patent: 08305728 (1996-11-01), None
patent: 09101991 (1997-04-01), None
patent: WO 97/18516 (1997-05-01), None
Ballesteros, Lisa et al. “Dictionary Methods for Cross-Lingual Information Retrieval,” in Proceedings of the 7thInternational DEXA Conference on Database and Expert System, pp. 791-801.
Beesley, Kenneth R. “Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-Line Text,” In the Proceedings of the 29thAnnual Conference of the American Translators Association, 1988.
Grefenstette, Gregory “Comparing Two Language Identification Schemes,” In the Proceedings of 3rdInternational Conference on Statistical Analysis of Textua Data (JADT 1995), Rome, Italy; Dec., 1995, vol. II, pp. 263-268.
De Marcken, Carl G. “Parsing the Lob Corpus,” In the Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics; Jun. 6-9, 1990; Pittsburgh, PA: pp. 243-251.
McEnery, Tony et al. Corpus Linguistics, Tony McEnery and Andrew Wilson, Ed., Edinburg University Press, Jul. 1996, pp. 117-145 and 189-192.
Porter, M.F. “An Algorithm for Suffix Stripping,” Program, vol. 14, No. 3, Jul. 1980, pp. 130-137.
Salton, Ger
Boudreau Leo
Mariam Daniel G.
Oliff & Berridg,e PLC
Xerox Corporation
LandOfFree
Document-based query data for information retrieval does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Document-based query data for information retrieval, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Document-based query data for information retrieval will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2893792