Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-06-24
2001-01-16
Alam, Hosain T. (Department: 2771)
Data processing: database and file management or data structures
Database design
Data structure types
C704S010000, C434S156000
Reexamination Certificate
active
06175834
ABSTRACT:
TECHNICAL FIELD
The present invention relates generally to word processing systems and more particularly to the identification of inconsistently spelled words within a document that contains Japanese text.
BACKGROUND OF THE INVENTION
Computer users are accustomed to using “checking” program modules (e.g., spell checkers, grammar checkers, and consistency checkers) for alerting users to words within a document that are questionable based on some predefined set of rules. For example, if a word is found in a document, but is not found in a spell checker's dictionary, then the word can be marked to indicate that it is questionable. Similarly, if a correctly spelled word is found in the spell checker's dictionary, but its spelling is inconsistent with other variants of the word in the same document (e.g., color and colour), then the lesser-used variant (or all of the variants) might be marked as questionable.
Japanese language consistency checkers are typically more complex than English language consistency checkers because Japanese consistency checkers must accommodate multiple acceptable spelling variants of a particular word. Typically, a document of Japanese text employs more than one writing system, with each system having a unique character set. The most commonly used Japanese writing systems are Kanji, Hiragana, and Katakana. Kanji is a writing system composed of pictographic characters, mostly derived from Chinese writing systems. Hiragana is a writing system that is phonetic in nature and shares no common characters with Kanji. Katakana is another phonetic writing system that is primarily used for writing words borrowed from Western languages, and also shares no common characters with Kanji. Kanji pictographs are analogous to shorthand variants of Hiragana words in that any Kanji word can be written in Hiragana, though the converse is not true. A single Japanese word can include characters from more than one writing system. For example, a correctly spelled word may be written using two Kanji characters, one Kanji character followed by two Hiragana characters, or by four Hiragana characters. In short, the challenge presented to consistency checking programs by documents containing Japanese text is that a variety of words can be acceptable variants of one another. Therefore, a Japanese word consistency checker must be complex in order to accommodate all acceptable variants.
A problem with currently available Japanese consistency checkers is that they do not provide a sufficient means for generating all of the common Japanese spelling variants. Because a document employing more than one Japanese writing system may include many acceptable word variants, the user may desire to be prompted when a word has been spelled inconsistently with other occurrences of the same word variant. That is, when one variant is different from others in the same document. Currently available Japanese consistency checkers utilize manual variant generation, thereby incurring the risk of overlooking common spelling variants.
Accordingly, there is a need for a Japanese language consistency checker that is capable of providing a method for identifying and generating substantially all acceptable spelling variants of a particular Japanese word. The Japanese language consistency checker should also be capable of identifying spelling variants that are used inconsistently with other spelling variants in the same document. The consistency checker should also be capable of maintaining statistics of spelling variant uses within a particular document, thereby enabling the consistency checker to identify lesser-used variants.
SUMMARY OF THE INVENTION
The present invention satisfies the above-described needs by providing an improved method for generating common Japanese spelling variants and for checking for inconsistent spellings among words in a document containing Japanese text. The present invention provides a method for breaking a word down into reading units, which are similar to syllables, and associating the reading units with reading pairs, which identify acceptable variants of the reading unit. By accessing a Reading Pair Database (RPD), the reading units of a particular word can be represented by Reading Pair Identification Numbers (RIDs). By representing the words within a document as RID arrays, the words can be mapped onto a Condensed Lexicon Database (CLD) in order to verify the RID array and generate a Sense Identification Number (SID). The SID provides a means by which spelling variants can be normalized. Normalization is accomplished by assigning all words that are spelling variants of one another the same SID. Inconsistent words are those words that belong to the same SID set (i.e., have the same SID), but have different spellings from other words in the SID set. These inconsistently spelled words are assigned Spelling Variant Identification Numbers (SVIDs) that are unique within the SID set.
The reverse process is utilized for the generation of Japanese spelling variants. Specifically, after a word is parsed into a RID array, all of that word's spelling variants can be generated by varying each reading unit in the RID array. The generation process provides a complete list of spelling variants which can be compiled into the CLD, for subsequent use in identifying inconsistent occurrences (i.e., spelling variants) of the same words. Because all of the generated spelling variants are assigned the same SID, the identification process is significantly simplified. Statistics are maintained on the existence and number of occurrences of spelling variants within a document by incrementing count values corresponding to each SVID.
In one aspect of the invention, a method is provided for checking the consistency of a plurality of words contained in a word list. By isolating reading units within each word, assigning each reading unit a RID (by reference to the RPD) and reforming each word as a RID array, the word can be mapped onto the CLD. Successfully mapping a word (in RID array form) onto the CLD generates the SID that is assigned to the word and permits the normalization of all words having the same SID. Normalization is further enhanced by assigning an SVID to each word, which identifies a particular spelling of each spelling variant having the same SID. A Reply Message is generated, reporting the success or failure of the attempt to map the word onto the CLD.
In another aspect of the invention, a data structure containing the RPD is provided. The RPD data structure contains three types of data. The first type of data is a plurality of RIDs. Each RID identifying a pair of reading units. The second type of data is the set of Kanji reading units constituting the reading pairs. The third type of data is the set of Hiragana reading units constituting the reading pairs. Each RID corresponds to a Kanji reading unit and a Hiragana reading unit, which as equivalent to each other.
In yet another aspect of the invention, a method of creating the RPD is provided. By comparing lists of Japanese words, reading units from various character sets can be isolated and associated with equivalent reading units from other character sets. The associated reading units can be stored as reading pairs and assigned a RID. A multi-pass approach to generating the reading pairs and associated RIDs permits the elimination of errant or low-occurrence reading pairs, in favor of well-established and high-occurrence reading pairs.
The various aspects of the present invention may be more clearly understood and appreciated from a review of the following detailed description of the disclosed embodiments and by reference to the appended drawings.
REFERENCES:
patent: 5258909 (1993-11-01), Damerau et al.
patent: 5321801 (1994-06-01), Ando
patent: 5448474 (1995-09-01), Zamora
patent: 5535119 (1996-07-01), Ito et al.
patent: 5634066 (1997-05-01), Takehara et al.
patent: 5634134 (1997-05-01), Kumai et al.
patent: 5715469 (1998-02-01), Arning
patent: 5963893 (1999-10-01), Halstead, Jr. et al.
patent: 0287713A1 (1988-10-01), None
Cai Patrick Pei
Halstead Patrick H.
Alam Hosain T.
Fleurantin Jean Bolte
Kirkpatrick & Stockton LLP
Microsoft Corporation
LandOfFree
Consistency checker for documents containing japanese text does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Consistency checker for documents containing japanese text, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Consistency checker for documents containing japanese text will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2482197