Apparatus, method and storage medium for identifying a...

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S503000

Reexamination Certificate

active

06246976

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a language identifying apparatus and a language identifying method for judging a language of a character string represented by a character code string and the type of its character code (a character code system), various apparatuses for identifying a language of a text (a sentence) or words or a word represented by fed text data or keyword (both are encoded) to switch various types of processing, and a storage medium storing a computer program for controlling the apparatuses or realizing the method.
2. Description of the Background Art
Character codes for kanji (or hangeul) currently used in Japan, China (the People's Republic of China), South Korea, and Taiwan (the Republic of China) represent one character by two bytes. The character codes (systems) are independently defined for each language (Japanese, Chinese, Korean, etc.). Characters in the same language are represented by different character codes if they differ in an encoding method (a character code system, the type or kind of code, or a rule for encoding). Information representing a language is not generally added to character code data. When a series of character codes is fed, therefore, it cannot be simply judged what language is encoded to obtain the character codes.
A language information processing system such as a database search system, a translation system, and a speech synthesis system is constructed on the basis of a particular language and its character code system. Let's consider a language information processing system which is available to a plurality of types of languages. Since language information processing differs depending on the type of language, languages represented by a fed keyword and text data must be found. If the language represented by the fed keyword or text data, and its character code system are not clear, suitable processing cannot be expected.
SUMMARY OF THE INVENTION
An object of the present invention is to make it possible to identify a language represented by a fed character code string and its character code system.
Still another object of the present invention is to make it possible to perform, even when a language represented by an entered keyword or text data and its character code system are not found, various types of language information processing suitable for respective languages.
A character code identifying apparatus according to the first invention is an apparatus for identifying a combination of a language represented by encoded text data and its character code system, characterized by comprising storage device storing for each combination of a language and a character code system a plurality of occurrence probability tables each describing the probability that a character code occurs in the combination, means for respectively reading out the occurrence probabilities from the plurality of occurrence probability tables with respect to one or a plurality of character codes included in the fed text data, to obtain evaluation data for each combination of the language and the character code system, and means for judging the combination of the language represented by the fed text data and the character code system on the basis of the obtained evaluation data.
The first invention also provides a method suitable for the above-mentioned apparatus. That is, the method is characterized by comprising the steps of preparing, for each combination of a language and a character code system, occurrence probability tables each describing the probability that a character code occurs in the combination, respectively reading out the occurrence probabilities from said plurality of occurrence probability tables with respect to one or a plurality of character codes included in fed text data, to obtain evaluation data for each combination of the language and the character code system, and judging the combination of the language represented by the fed text data and the character code system on the basis of the obtained evaluation data.
Furthermore, the present invention also provides a storage medium storing a program for carrying out the above-mentioned method. That is, the storage medium stores a program for identifying a combination of a language represented by encoded text data and its character code system using occurrence probability tables each describing for each combination of a language and a character code system the probability that a character code occurs in the combination, the program controlling a computer so as to respectively read out the occurrence probabilities from the plurality of occurrence probability tables with respect to one or a plurality of character codes included in fed text data, to obtain evaluation data for each combination of the language and the character code system, and to judge the combination of the language represented by the fed text data and the character code system on the basis of the obtained evaluation data. The storage medium is a magnetic disk storage device, a magneto-optic disk storage device, an optical disk storage device, a magnetic tape, a semiconductor memory, etc.
The probability that a character code occurs depends on a combination of a language of characters represented by the character codes and its character code system. Even in the same character code, the probability that the character code occurs differs depending on the language. Even in the same language, the probability that the same character code occurs differs depending on the character code system. The first invention is directed to judging, by paying attention to the occurrence probability of a character code peculiar to the combination of the language and the character code system, the types of a language represented by the character code and its character code system.
According to the first invention, the occurrence probabilities are read out from the occurrence probability tables for each character code in an entered character code string, so that the evaluation data is produced for each combination of the language and the character code system. If the evaluation data related to the occurrence probability is low, it is judged that the possibility that the entered character code string is not related to the combination of the language and the character code system is high. On the other hand, if the evaluation data is high, it is considered that the possibility that the entered character code string is related to the combination of the language and the character code system is high. A combination of a language represented by fed text data (a character code string) and its character code system is thus judged on the basis of evaluation data.
It is preferable that the product of the occurrence probabilities read out from the occurrence probability tables is calculated, to judge a language represented by text data and an encoding method on the basis of the calculated product. If the probability of any one of character codes is zero or very close to zero, the product becomes a very small value, so that a combination of the language and the character code system is clearly excluded.
A multilingual morphological analysis system according to the second invention is characterized by comprising language identification means for identifying a language represented by fed text data, a plurality of morphological analysis means respectively provided with respect to a plurality of languages, and control means for feeding the fed text data to the morphological analysis means suitable for the language identified by the language identification means.
The second invention also provides a method suitable for the above-mentioned apparatus. That is, the method is characterized by comprising the steps of providing a plurality of morphological analysis devices with respect to a plurality of languages, identifying a language represented by fed text data, and feeding the fed text data to the morphological analysis device suitable for the identified language.
The second invention also provides a storag

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Apparatus, method and storage medium for identifying a... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Apparatus, method and storage medium for identifying a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus, method and storage medium for identifying a... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2495169

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.