Data processing: speech signal processing – linguistics – language – Linguistics
Reexamination Certificate
1998-06-25
2002-06-04
Edouard, Patrick N. (Department: 2644)
Data processing: speech signal processing, linguistics, language
Linguistics
C707S793000
Reexamination Certificate
active
06401060
ABSTRACT:
TECHNICAL FIELD
The present invention relates to word processing systems, and more particularly relates to detecting typographical errors and generating replacement strings in documents that contain Japanese text.
BACKGROUND OF THE INVENTION
Typographical (spelling) checkers, style checkers, and grammar checkers are common in modem word processing programs. The Japanese language presents interesting problems in this area because of several characteristics of the written language. First, the Japanese language employs several different alphabets, which may be used in combination. Second, Japanese text is typically written without any spaces between words. Third, the Japanese language has a highly productive morphology, which means Japanese words can undergo significant spelling changes to indicate case, tense, politeness, aspect, mood, or voice, etc.
The most commonly used Japanese alphabets (or writing systems) are Kanji, Hiragana, and Katakana. The Kanji alphabet includes pictographs or ideographic characters that were adopted from the Chinese alphabet. Hiragana and Katakana are phonetic alphabets that do not include any characters common to each other or to Kanji. Hiragana is used to spell words of Japanese origin. Katakana is used to spell words of foreign (primarily western) origin. Kanji pictographs are analogous to shorthand variants of Hiragana words in that any Kanji word can be written in Hiragana, though the converse is not true. A single Japanese word can include characters from more than one alphabet.
One of the functions performed by typographical checkers is to detect malformed phrases, or words, and suggest replacement text strings. The types of malformed words detected by typographical checkers include (using the example of “hello”): 1) transposed characters (e.g., “helol”; 2) Missing characters (e.g., “hllo”); 3) duplicate characters (e.g., “heello”); 4) extra characters (e.g., “hepllo”) and 5) a wrong character (e.g., “hwllo”). One approach to performing typographical checking for the Japanese language is to use a dictionary look-up. This approach looks up every word or stem in the document and compares it against a Japanese dictionary to determine if it is valid. However, over-flagging of some words and under-flagging of typographical errors can occur due to the large number of characters in the Japanese language and non-delimited nature of Japanese text.
Another approach to typographical checking uses a heuristic pattern-match. In this approach, rules are used to identify frequent typographical mistakes. In this approach, though, there is often under-flagging of typographical errors because these errors cannot be easily classified into groups when written in the Japanese language.
Yet another approach to typographical checking uses a statistical likelihood of occurrence. This approach uses a large trained corpus of text to compute a probability of whether any given string of characters is well-formed. This approach suffers from requiring a significant investment in training corpora which often contain typographical errors themselves. In addition, because there are an infinite number of sentences in the Japanese language, it is very difficult to robustly model well-formed strings using this approach.
Therefore, there is a need in the art for an improved method for identifying typographical errors in Japanese text and generating replacement strings for malformed text. An acceptable Japanese language solution should be small enough (in terms of memory requirements) and fast enough to perform satisfactorily in a desktop computer environment.
SUMMARY OF THE INVENTION
The present invention satisfies the above-described needs by providing an improved method for detecting typographical errors and generating replacement strings in documents containing Japanese text. The present invention employs a bottom-up approach utilizing a dictionary, heuristics and probability analysis to determine whether a typographical error exists and then utilizes heuristics, finite-state morphology and a dictionary to generate a replacement string.
Generally described, the present invention parses a Japanese sentence using morpho-lexical analysis. The result of the morpho-lexical analysis is a list of valid phrases that are contained in the Japanese sentence and a cost associated with each phrase. The phrase corresponds to the standard phonological unit, called bunsetsu, taught in Japanese schools. The present invention operationally defines a phrase as one or more dictionary words (in their stem or non-conjugated form) prefixed and/or postfixed with zero or more morphemes. Since the phrase is constructed from morphemes and dictionary words (lexical entries), the analysis is described as morpho-lexical in nature. The cost associated with each phrase is derived from the probability that each word and morpheme making it up, and the combination thereof, constitute the intended analysis of the corresponding set of characters in the input sentence. The present invention receives the valid phrases and their associated costs from the morpho-lexical analysis. The valid phrases are then combined in such a way as to create all possible non-overlapping sets of phrases in efforts to find one such set that represents the entire string of characters in the input sentence. For simplicity, these sets of non-overlapping phrases are referred to as phrase lists. When the phrases are combined, their respective costs are also combined, resulting in a summed associated cost for the phrase list. If any phrase list spans the input sentence, i.e., the phrase list exactly duplicates the input sentence, no typographical error exists and processing ceases.
If no spanning phrase list exists, then the phrase list containing the lowest combined associated cost, i.e., the phrase list having the combined associated cost signifying that it is most representative of the input sentence, is selected. Using the selected phrase list, any “holes” are determined. A hole is a character, or set of characters, that are found in the input sentence but not in the selected phrase lists. In other words, the hole is a character or set of characters where the selected phrase list does not span the input sentence corresponding to a gap in the analysis. The hole is where any typographical error exists, if any, within the input sentence. For one aspect of the present invention, the hole is checked to determine if any part of it can be analyzed, using morpho-lexical process, as a valid phrase when an extended dictionary is enabled. In addition, rules are applied to determine if any part of the hole can be analyzed, using morpho-lexical process, as a valid phrase when an extended dictionary is enabled. In addition, rules are applied to determine if any part of the hole can be analyzed as a proper noun. The hole may be “relaxed” by adding contiguous characters next to the hole from the input sentence and rechecking the “relaxed”, hole in the same way as above, i.e., by enabling an extended dictionary and performing a secondary morpho-lexical analysis and by applying a set of proper noun rules.
A replacement string is then generated for the hole. The replacement string is generated using heuristics (rules) intended to counteract the process by which the error was created. The rules match patterns associated with certain types of errors and make appropriate changes to correct those errors, associating a cost with each correction. The replacement candidates thus generated then undergo morpho-lexical analysis and are ranked according to the combination of their associated costs. All candidates which score better than a certain threshold value, i.e., have a lower cost, are presented to the user as potential replacements.
The advantage of the bottom-up approach applied by the present invention for identifying typographical errors is that it greatly reduces the number of searching tasks, and consequently processing, required to find a typographical error. This reduced processing thereby increases performance and efficiency of typographical error che
Critchlow Richard Lee
Halstead Patrick H.
Edouard Patrick N.
Kelly Joseph R.
Microsoft Corporation
Westman Champlin & Kelly P.A.
LandOfFree
Method for typographical detection and replacement in... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for typographical detection and replacement in..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for typographical detection and replacement in... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2964048