Lattice and method for identifying and normalizing...

Image analysis – Pattern recognition – Ideographic characters

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S229000, C707S793000

Reexamination Certificate

active

06731802

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to a computer-based method for identifying text. More particularly, the present invention relates to a lattice and method for identifying and normalizing orthographic variations in Japanese text.
BACKGROUND OF THE INVENTION
Word segmentation refers to the process of identifying the individual words that make up an expression of language, such as text. Word segmentation is useful for checking spelling and grammar, synthesizing speech from text, performing natural language parsing and understanding, and searching a collection of documents for specific words or phrases, all of which benefit from an identification of individual words.
Performing word segmentation of English text is rather straightforward, since spaces and punctuation marks generally delimit the individual words in the text. In Japanese text however, word boundaries are implicit rather than explicit. That is, Japanese text typically does not include spaces or punctuation between words. Therefore, segmentation cannot be performed in the same manner as English word segmentation. Other characteristics of Japanese text further complicate the matter. For example, potential word candidate records may overlap (causing ambiguities for the parser) or there may be gaps where no suitable record is found (causing a broken span). Also, the language includes four different scripts that are in common use—kanji, hiragana, katakana and roman. Furthermore, these different scripts can be mixed within lexical entries. Additionally, many Japanese words have a variety of acceptable spellings and certain characters are optional.
Existing segmenting methods involve adding orthographic variations to the lexicon as they are encountered (requiring a long-term maintenance commitment), or lexicalizing all possible variations (requiring a much larger lexicon). An accurate and efficient approach to automatically performing Japanese word segmentation would have significant utility.
The present invention provides a solution to this and other problems and offers other advantages over the prior art.
SUMMARY OF THE INVENTION
The present invention relates to a lattice and method for identifying and normalizing orthographic variations in Japanese text.
One embodiment of the present invention is directed to a computer-readable medium having stored thereon a data structure that includes multiple data fields collectively representing a Japanese lexical entry. The multiple data fields include a plurality of multi-form data fields. Each multi-form data field is capable of holding data representing a word element of the lexical entry. Each multi-form data field includes two subfields. The first subfield contains data representing a primary form of the corresponding word element. The second subfield contains data representing an alternate form of the corresponding word element.
In an illustrative embodiment of the invention the data structure includes a lattice of the form:
[W:ab][X:c] . . . [Y:def]
where W, X and Y each represent a primary-orthography character; a, b, c, d, e, and f each represent an alternate orthography character; ab, c, and def represent an alternate representation to W, X and Y, respectively; and the lattice as a whole represents a plurality of orthographic forms of the lexical entry.
Another embodiment of the present invention is directed to a method of normalizing orthographic variations in the Japanese language. According to this method, an orthography lattice is maintained for each of multiple lexical entries. Each lattice represents a plurality of orthographic forms of the lexical entry and includes at least one word-element representation representing multiple forms of a word element of the lexical entry. Each word-element representation includes a primary form of the word element and an alternate form of the word element. Each lattice is normalized to produce a normalized form that includes the primary form of each word element representation of the lattice and that does not include the alternate form of each word element representation.
Another embodiment of the present invention is directed to a method of segmenting Japanese text. According to the method, an orthography lattice is stored for each of multiple lexical entries. Each lattice represents a plurality of orthographic forms of the lexical entry and includes at least one word-element representation representing a plurality of forms of a word element of the lexical entry. Each word-element representation includes a primary form of the word element and an alternate form of the word element. A sequence of input characters is received and the input sequence is evaluated against the plurality of lattices. If any orthographic form of one of the lexical entries is present in the input sequence, a normalized form of that lexical entry is generated that comprises the primary form of each word-element representation of the lattice corresponding to the entry and that does not include the alternate form of each word-element representation.
Another embodiment of the present invention is directed to a another, method of segmenting Japanese text. According to the method, an orthography lattice is stored for each of multiple lexical entries. Each lattice represents a plurality of orthographic forms of the lexical entry. Each lattice includes at least one word-element representation. Each word-element representation represents multiple different forms of the corresponding word element of the lexical entry. Each word-element representation can include a primary form of the word element and an alternate form of the word element. A character input that is part of an input string is received. The received character input is compared to the first word-element representation of each lattice. If the received character input matches either the primary form or the alternate form of the first word-element representation of a particular lattice, the subsequent characters in the input string are compared to further word-element representations in the particular lattice in order to ascertain whether any orthographic forms of the lexical entry corresponding to the particular lattice are present in the input string beginning with the received character input. In an illustrative aspect of this embodiment of the invention, if any orthographic forms of the lexical entry corresponding to the particular lattice are present in the input string beginning with the received character input, a normalized representation of the lexical entry is generated which includes the primary form of each word-element representation of the lattice and that does not include the alternate form of each word-element representation.
Another embodiment of the present invention is directed to yet another method of segmenting Japanese text. According to this method, an orthography lattice is stored for each of a plurality of lexical entries. An all-alternate-orthography form is also stored for each lexical entry. Each all-alternate-orthography form consists exclusively of alternate orthography characters and does not contain any primary orthography characters. An input character that is part of an input string of characters is received. It is determined whether the received input character is a primary orthography character or an alternate orthography character. If the received input character is an alternate orthography character, the input character is compared to the first character of each stored all-alternate-orthography form. Then, if the input character matches the first character of a particular all-alternate-orthography form, subsequent characters in the input string are compared to further characters in the particular all-alternate-orthography form. In this way, it is ascertained whether the all-alternate-orthography form of the corresponding lexical entry is present in the input string beginning with the received input character. If, on the other hand, the received input character is a primary orthography character, the input chara

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Lattice and method for identifying and normalizing... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Lattice and method for identifying and normalizing..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Lattice and method for identifying and normalizing... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3185485

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.