Data processing: speech signal processing – linguistics – language – Linguistics
Reexamination Certificate
1997-03-31
2001-01-23
Isen, Forester W. (Department: 2747)
Data processing: speech signal processing, linguistics, language
Linguistics
C704S009000
Reexamination Certificate
active
06178396
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a word/phrase classification processing method, phrase extraction method, word/phrase classification processing apparatus, speech recognition apparatus, machine translation apparatus, phrase extraction apparatus, and a word/phrase storage medium. Particularly, the present invention is suitable for extracting phrases from text data and automatically classifying words and phrases.
2. Description of the Related Art
One type of conventional word classification processing apparatus automatically classifies a single word by statistically processing the single word used in text data, and performs speech recognition and machine translation using the result of classifying the word, for example, as recited in the document titled “Brown, P., Della Pietra, V., deSouza, P., Lai, J., Mercer, R. (1992) “Class-Based n-gram Models of Natural Language”. Computational Linguistics, Vol. 18, No. 4, pp. 467-479”.
However, the conventional word classification processing apparatuses cannot automatically classify words and phrases together, and cannot perform speech recognition and machine translation using the correspondence or similarity between word and phrase or between phrases. Therefore, they cannot perform the speech recognition or machine translation accurately.
SUMMARY OF THE INVENTION
A first object of the present invention is to provide a word/phrase classification processing apparatus and method thereof which can automatically classify word and phrase as one block.
A second object of the present invention is to provide a phrase extraction apparatus which can extract a phrase from a large amount of text data at a high speed.
A third object of the present invention is to provide a speech recognition apparatus which can perform accurate speech recognition using the correspondence or similarity between word and phrase or between phrases.
A fourth object of the present invention is to provide a machine translation apparatus which can perform accurate machine translation using the correspondence or similarity between word and phrase or between phrases.
To attain the above described first object, word and phrase included in text data are classified together to generate a class in which the word and phrase exist together, according to the present invention.
With such a class, not only words, but word and phrase or phrases can be classified as one block, thereby easily identifying the correspondence or similarity between the word and phrase or between the phrases.
Furthermore, according to an embodiment of the present invention, a one-dimensional sequence of word classes is generated by mapping word classes into which words are classified, into a one-dimensional sequence of words included in text data. Then, a word class sequence in which all of the degrees of stickiness between contiguous word classes are equal to or more than a predetermined value, is extracted from the one-dimensional sequence of word classes of the text data and has a token attached. After word and token are classified together, a word class sequence corresponding to the token is replaced with a phrase belonging to that word class sequence.
As described above, a token is attached to a word class sequence to regard that sequence as one word. As a result, equal handling of a word included in text data and a word class sequence with a token attached allows classification processing to be performed without making a distinction between word and phrase. Additionally, a phrase is extracted based on the degree of stickiness between contiguous word classes by mapping word classes into which words are classified into a one-dimensional sequence of words included in text data to generate a one-dimensional sequence of word classes, so that the phrase can be extracted from the text data at high speed.
Additionally, to attain the above described second object, word classes into which words are classified are mapped to a one-dimensional sequence of words included in text data to generate a one-dimensional sequence of word classes. Then, a word class sequence in which all of the degrees of stickiness between contiguous word classes are equal to or more than a predetermined value, is extracted from the one-dimensional sequence of word classes of text data, so that a phrase is extracted by taking out respective words existing contiguously in the text data from respective word classes structuring the word class sequence, according to the present invention.
With such a process, a phrase can be extracted based on a word class sequence. Since the number of word classes into which different words in text data are classified is smaller than the number of the different words, extracting a word class sequence in which all of the degrees of stickiness between contiguous word classes are equal to or more than a predetermined value from a one-dimensional sequence of word classes of text data, allows a reduction in the amount of operations and a memory capacity, a quicker performance of a process for extracting a phrase, and a saving of memory resources, compared with extracting a word sequence in which all of the degrees of stickiness between contiguous words are equal to or more than a predetermined value, from a one-dimensional sequence of words included in the text data. Note that a word class sequence may sometimes include a word sequence which does not exist in a one-dimensional sequence of words in text data. In this case, respective words existing contiguously in the text data are extracted from respective word classes structuring the word class sequence, and the extracted words are recognized as a phrase.
Furthermore, to attain the above described third object, speech is recognized by referencing a word/phrase dictionary for classifying word and phrase in predetermined text data as a class in which the word and phrase exist together, and storing the class, according to the present invention.
With such a process, speech recognition can be performed using the correspondence or similarity between word and phrase or between phrases, thereby enabling an accurate process.
Still further, to attain the above described fourth object, an input original sentence is corresponded to an original sentence sample stored in a sample sentence book, based on a word/phrase dictionary for classifying word and phrase in predetermined text data as a class in which the word and phrase exist together, according to the present invention.
Accordingly, even if an original sentence which is a variation of an original sentence sample stored in the sample sentence book, and includes a phrase replacing an original word in the original sentence sample, is input, the original sentence sample is applied to the input original sentence, so that machine translation can be performed. Therefore, accurate machine translation using the correspondence or similarity between word and phrase or between phrases, can be realized.
REFERENCES:
patent: 5805832 (1998-09-01), Brown et al.
patent: 5809476 (1998-09-01), Ryan
patent: 5819221 (1998-10-01), Kondo et al.
patent: 5963965 (1999-10-01), Vogel
Ushioda, “Hierarchical Clustering of Words” Fourth Workshop on Very Large Corporations, Aug. 4, 1996.
Ushioda, “Hierarchical Clustering of Words and Application to NLP Tasks”, COLING-96, Aug. 5, 1996.
Bellegarda et al., “A Novel Word Clustering Algorithm Based on Latent Semantic Analysis”, ICASSSP, May 1-10, 1996, pp. 172-175.
Miller et al., “Evaluation of a Language Model Using a Clustered Model Backoff” ICASSP, May 7-10, 1996, pp.
Farhat et al., “Cluatering Words For Statistical Language Models Based on Nontextual Word Similarity” ICASSP, May 7-10, 1996, pp. 180-183.
Edouard Patrick N.
Fujitsu Limited
Isen Forester W.
Staas & Halsey , LLP
LandOfFree
Word/phrase classification processing method and apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Word/phrase classification processing method and apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Word/phrase classification processing method and apparatus will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2501567