Data processing: speech signal processing – linguistics – language – Linguistics – Natural language
Reexamination Certificate
1996-09-30
2001-04-10
Isen, Forester W. (Department: 2747)
Data processing: speech signal processing, linguistics, language
Linguistics
Natural language
C704S001000, C707S793000
Reexamination Certificate
active
06216102
ABSTRACT:
BACKGROUND OF THE INVENTION
The subject invention relates generally to human language recognition technology. More particularly, the invention relates to a technique for identifying the language used in a computerized document.
Computers and computer networks have intensified the transmission of coded documents between people who speak and write in different natural languages. The internet has recently accelerated this process. This results in several problems. In the prior art, for example, when an electronic document was sent across national boundaries, computer system operations were interrupted so that a human being could determine the natural language of a received document before a given operation such as selecting, displaying, printing, and so forth which may be dependent upon the peculiarities of an given natural language. In the context of an internet search, unless the user is multilingual, he is likely to be interested only in the retrieved documents in his native language, or at any rate, only those languages he reads.
The invention described herein eliminates the need for such human intervention by automatically determining the correct natural language of the computer recorded document.
Prior to the applicants' own contributions to the art, the general problem was recognized in the prior art. In the area of automated language identification of coded text, the prior art used n-gram character based systems, which handle each character multiple times, a process which consumes a great deal of system resource when compared to the applicants' word-based technique described below. In speech recognition systems, language recognition uses language and speech characteristics, e.g., trigrams or emphasis which require large amounts of text to be parsed and measured, and large amounts of time for processing. These techniques are based on some form of matching algorithm based on language statistics that are not meaningful in a linguistic context.
Prior systems using trigrams, n-grams, and other artificial divisions in a computerized text are not considered reliable, and they are very slow and consume considerable computer time, as they handle each character multiple times for a document., e.g., each document character appears in three different trigrams. Characteristics measured, or derived from, but which are not actual components of written languages such as trigrams or letter sequences, have limited success in identifying the correct language, and require large amounts of text to be parsed and measured. Similarly, prior systems which depend on the attributes of individual characters and their local contexts are also limited when applied to the problem of identifying a language.
In the invention described herein, none of the prior art techniques, e.g., classifying language by signal waveform characteristics, trigrams, n-grams, or artificial divisions of written language, were used. In both inventions, words are read from a computer document and compared to predetermined lists of words selected from a plurality of languages of interest. The word lists comprise relatively few of the most commonly used words in each language; statistically, a significant percentage of all words in any document will be the most common words used in its language. The language or genre of the document is identified by a process that determines which language's word-list most closely matches the words in the document.
In the parent application, the applicants have taught that the closeness of match can be determined by the sum of the normalized frequency of occurrence of listed words in each language or genre of interest. Each language's word-list and the associated frequency of occurrence for each word in the list is kept in a word table. The word table is linked with a respective accumulator whose value is increased each time a word from an inputted document matches one of the common words in one of the tables. The process adds the word's normalized frequency of occurrence, as found in the word table, to the current sum in the accumulator associated with the respective language. When processing stops, the identified language is the language associated with the highest-valued accumulator. Processing may stop either by reaching the end of the document or by achieving a predetermined confidence in the accumulated discrimination.
However, the applicants have taught that weighting in the accumulation process is less preferred and that it can be eliminated if the actual frequency of occurrence of words in each of the candidate natural languages can be established and the word tables have a substantially equivalent coverage of the respective candidate languages assembled.
The present application is an improvement of the basic invention of word counting for natural language determination to allow the language identification in the most efficient and expeditious manner.
SUMMARY OF THE INVENTION
It is therefore an object of the invention to identify the natural language in which a computer stored document is written from a plurality of candidate languages in a most efficient manner.
It is another object of the invention to identify in which of several pre-specified natural languages a given body of text is written.
It is another object of the invention to provide a mechanism which is very, very fast.
It is another object of the invention to minimize the memory requirements.
It is another object of the invention that the memory requirements are fixed regardless of the number of words stored.
These objects and others are accomplished by comparing the short and truncated words of a document to word tables of most frequently used words in each of the respective candidate language to identify the language in which the document is written. First, a plurality of words from a document is read into a computer memory. Then, words within the plurality of words which exceed a predetermined length are truncated to produce a set of short and truncated words. The set of short and truncated words are compared to words in a plurality of word tables. Each word table is associated with and contains a selection of most frequently used words in a respective candidate language. Although the most frequently words in most languages tend to be short those which which exceed the predetermined length may be truncated in the word tables. A respective count for each candidate language each time one of the set of short and truncated words from the document matches a word in a word table associated with the candidate language. In some embodiments, the count may weighted by factors related to the frequency of occurrence of the words in the respective candidate languages. The language of the document is identified as the language associated with the count having the highest value.
The speed of language determination by this invention is very fast, because only a relatively small number of words need to be read from any document to reliably determine its language or genre. All languages can be processed in parallel in hardware at commensurate hardware speeds, as opposed to software speeds. The basic operations required within the hardware are far simpler, hence intrinsically faster, than their software equivalents.
Further, an advantage of the present invention is that only a few words, e.g., 25-200, need be contained in the word table for each candidate language of interest, so that in practice each word is tested against only a relatively small number of words for reliable language recognition. In the hardware embodiment, there is no comparison between words performed. Each word is used as an address into a bit table (set of tables) and the address either contains all “1” bits or not. There is no comparison operation. As discussed below, it is important that the words selected for the word frequency tables for each language cover a commensurate percentage of the frequency of occurrences in their respective languages.
REFERENCES:
patent: 4674066 (1987-06-01), Kucera
patent: 4773009 (1988-09-01), Kucera et al.
pa
Martino Michael John
Paulsen, Jr. Robert Charles
Edouard Patrick N.
International Business Machines - Corporation
Isen Forester W.
LaBaw Jeffrey S.
LandOfFree
Natural language determination using partial words does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Natural language determination using partial words, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Natural language determination using partial words will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2458578