Image analysis – Pattern recognition – Context analysis or word recognition
Reexamination Certificate
1998-12-29
2001-07-31
Johns, Andrew W. (Department: 2721)
Image analysis
Pattern recognition
Context analysis or word recognition
Reexamination Certificate
active
06269189
ABSTRACT:
FIELD OF THE INVENTION
The invention relates to techniques that find selected character strings in text.
BACKGROUND AND SUMMARY OF THE INVENTION
U.S. Pat. No. 5,748,805 discloses a technique that provides translations for selected words in a source document. An undecoded document image is segmented into image units, and significant image units such as words are identified based on image characteristics or hand markings. For example, a user could mark difficult or unknown words in a document. The significant image units are then decoded by optical character recognition (OCR) techniques, and the decoded words can then be used to access translations in a data base. A copy of the document is then printed with translations in the margins opposite the significant words.
U.S. Pat. No. 5,748,805 also discloses a technique applicable in a reading machine for the blind. The user can designate key words on a key word list. The user designated key words can then be found in the document, such as by OCR techniques, and regions around the key word can be identified as significant. Significant words can be decoded using OCR techniques and supplemental data can be retrieved and provided in a Braille version or speech synthesized version to the user.
Karttunen, L., “Directed Replacement”, in Proceedings of the 34
th
Annual Meeting of the Association for Computational Linguistics
, Santa Cruz, Calif., 1996, pp. 108-116, discloses a family of directed replace operators, written “UPPER @→>LOWER”, that can be implemented in finite state transducers (FSTs). A transducer is disclosed that transduces an input string from left to right, making only the longest possible replacement at each point. This can be thought of as a procedure that copies the input until it finds an instance of UPPER; then the procedure selects the longest matching substring, which is rewritten as LOWER, and proceeds from the end of that substring without considering any other alternatives. The result is that the input string is parsed into a sequence of substrings that are either copied to the output unchanged or replaced by some other strings. Applications include tokenizing transducers that insert an end of token mark at the end of each multiword token in a list (which may include overlapping expressions), filtering transducers that divide an input stream into regions that are passed on and regions that are ignored, and marking transducers that mark instances of a regular language without otherwise changing text.
Chanod, J.-P., and Tapanainen, P., “A Lexical Interface for Finite-State Syntax”, Technical Report MLTT-025, Rank Xerox Research Centre, Grenoble Laboratory, 1996, pp. 1-22, describe a lexical interface as implemented and used for development of a French constraint grammar. The lexical interface analyzes sentences. Tokenization uses (1) a multiword expression lexicon that encodes basic multiword expressions (MWEs), numerical or time expressions, and general regular expressions and (2) a tokenizing automation that recognizes simple tokens without considering whether they belong to the language. Morphological analysis is performed using several ordered lexical transducers, referred to as a lexicon stack, with the first transducer to recognize a token determining its lexical interpretation. The top lexicon in the stack is the multiword expression lexicon described above; second are one or more real lexicons; third is a guesser that accepts anything and always gives an analysis; and, lastly, the default lexicon matches any string to a noun.
During tokenization, a sentence string is read from the left, and the longest substring matched either in the tokenizing automation or the multiword expression lexicon is identified as a token; the process iterates with the rest of the sentence. Tokens matched in the automation are then looked up in the lexicon stack to obtain morpho-syntactic information, ignoring the top lexicon; tokens matched in the top lexicon do not require further lookup, as the lexicon is a transducer that provides the relevant morpho-syntactic information. The lexicons produce a finite-state network for each token, and the analyses are concatenated to form a finite-state network with all possible readings for a sentence.
These techniques are further described in Chanod, J.-P., and Tapanainen, P., “A Non-deterministic Tokeniser for Finite-State Parsing”, in
ECAI
'96, 12
th
European Conference on Artificial Intelligence, Workshop on Extended Finite-State Models of Language
, 1996, pp. 10 and 11 and Chanod, J.-P., and Tapanainen, P., “A Robust Finite-State Parser for French”,
ESSLLI
'96
Workshop
on
Robust Parsing
, Aug. 12-16, 1996, Prague, Czech Republic.
The invention addresses problems that arise in automatically finding selected words, multi-word expressions, or other strings of characters in text. Automatic string finding is useful in a variety of applications that use information about the presence, locations, and frequencies of selected strings in text. For example, an application could insert a translation, a hypertext link, or another annotation in association with each selected string that occurs in a text.
If an automatic string finding system were to use elaborate linguistic tools such as general lexicons, part-of-speech disambiguators, and syntactic analyzers, the system would be memory- and computation-intensive and, therefore, slow and expensive. In addition, unless it were made even more complex, the system would be specific to one language, and possibly to a single domain of terminology. These problems are collectively referred to herein as the “complexity problems”.
The invention is based on the discovery of new techniques for finding character strings in text. The new techniques alleviate the complexity problems, providing simpler automatic string finding. The techniques automatically search a text to find character strings in the text that match any of a list of selected strings. In doing so, the techniques perform a series of iterations, each with a starting point in the text. Each iteration determines whether its starting point is followed by a character string that matches any of the selected strings and that ends at a probable string ending. Then, the iteration finds the next iteration's starting point at a probable string beginning.
Each iteration thus performs an operation that determines whether a starting point in the text is followed by a matching character string ending at a probable string ending. Each iteration also performs an operation that finds a probable string beginning as a starting point for the next iteration. Both of these operations can be performed very rapidly with tools that are already available or can be easily constructed. Therefore, the new techniques can be readily implemented to obtain a simple, fast automatic string finder.
To find probable string beginnings and endings, for example, the new techniques can perform tokenizing, such as with a finite state tokenizer, to find probable token breaks such as word separators. Where each of the selected strings begins and ends at a token break, as would be true for words and multiword expressions, tokenization will find probable string beginnings and endings.
To determine whether a starting point is followed by one of the selected character strings, the new techniques can further use a finite state data structure, such as a finite state transducer (FST) or a finite state automation (FSA), as a lexicon of the selected strings. The finite state data structure can include acceptance data that are accessed after matching an acceptable character string. The acceptance data thus indicate that the string is acceptable. An FST could have a level that accepts only selected character strings, and an FSA can similarly accept only selected character strings.
The techniques can access the finite state data structure and the tokenizer with the string of characters that follows a starting point, seeking to match each character in turn. Based on any failure to match a character and on any acceptan
Azarian Seyed
Johns Andrew W.
Xerox Corporation
LandOfFree
Finding selected character strings in text and providing... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Finding selected character strings in text and providing..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Finding selected character strings in text and providing... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2557493