Method for identifying the language of individual words

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S001000, C382S230000

Reexamination Certificate

active

06292772

ABSTRACT:

BACKGROUND OF THE INVENTION
The state of the art for identifying the language of text documents involves the statistical analysis of the words and characters used in the entire document or sizable portions of the document. As such, the state of the art cannot identify the language of individual words in isolation, nor is it effective in identifying the language of documents that contain multiple languages, such as dual-language documents (e.g., Canadian parliamentary proceedings are printed in both English and French on the same page), or documents which contain short quotes of a foreign language or which occasionally use an isolated foreign language term.
PRIOR ART
U.S. Pat. No. 5,689,616 entitled “Automatic Language Identification/Verification System” relates to processing spoken text to extract phonetic speech features that are syllabic nuclei of languages to be identified using an artificial neural network. The method involves a comparison of the features of input speech with trained models for each language, where the models were trained using well-articulated reference speakers. The present invention is different in that it involves text, not speech, and uses a highly efficient and accurate regular expression instead of neural networks.
U.S. Pat. No. 5,189,727 entitled “Method and Apparatus for Language and Speaker Recognition” is also specific to speech and uses short frequency histograms to find the closest fit between the input speech spectra and several known languages. The present invention is different in that it involves text, not speech, and frequency spectra are irrelevant for text applications.
U.S. Pat. No. 5,548,507 entitled “Language Identification Process Using Coded Language Words” uses word frequency tables of the most common words in each language and their normalized frequency of occurrence to identify the most likely language in the document.
U.S. Pat. No. 5,701,497 entitled “Telecommunication Apparatus Having a Capability of Translation” requires the transmission of a protocol message that identifies the source language and so requires the sender to identify the language. The present invention is different in that the machine identifies the language of the sender, not the sender.
U.S. Pat. No. 5,440,615 entitled “Language Selection for Voice Messaging System” uses source information from the call (e.g., the area and country code of the caller's telephone number) to identify the most likely language used by the caller based on a stored list of the most common languages spoken at each location. The present invention is different in that it works in any textual environment and does not need the extra cues provided by a telephone caller ID system.
U.S. Pat. No. 5,392,419 entitled “Language Identification System and Method for a Peripheral Unit” tabulates syntactic cues present in the language to be identified. Each cue is assigned a positive or negative score for each language and the overall score for the document is the sum of the scores for the syntactic cues detected in the document. The language with the highest score is selected as the most likely language used in the document.
U.S. Pat. No. 5,062,143 entitled “Trigram-Based Method of Language Identification” uses letter trigrams to identify the language used in the document. For each language, it tabulates the trigrams that are most distinctive for the language (i.e., those that appear above a given frequency). It counts the number of such trigrams that appear in the document, comparing it to the total number of trigrams in the text. If the ratio is above a predetermined threshold, the document is identified as possibly using the associated language. The language with the highest ratio is selected as the language in which the document is written. The present invention is, however, not limited to letter trigrams, but uses letter n-grams of any length. Moreover, U.S. Pat. No. 5,062,143 allows the trigrams to overlap, whereas the present invention prevents the n-grams from overlapping and requires each word to be split into a sequence of language-specific n-grams without gaps or leftover letters. The present invention also allows some n-grams to be restricted to occurring in certain positions of the word, such as at the beginning, middle or end of the word. These differences are the keys to the higher accuracy of the present invention.
U.S. Pat. No. 5,425,110 entitled “Method and Apparatus for Automatic Language Determination of Asian Language Documents” distinguishes different Asian languages in printed documents containing Asian characters by comparing histograms of optical pixel density of the connected components of the document image with profiles for each Asian language.
The present invention is different from these systems in that it identifies the language of individual words with very high accuracy, not entire documents. This allows the present invention to operate on a word-by-word basis, correctly identifying the language of words even when the document contains multiple languages (e.g., Canadian parliamentary proceedings contain both English and French) or includes short quotes of one language within a document that is mostly another language. This allows language-specific functionality, such as language-specific spelling correction and transliteration (e.g., ASCII-to-Kanji conversion of Japanese Romaji to Kanji letters) to occur on a word-by-word basis. The language identification statistics for the individual words of a document can be combined to identify the overall language of a document with much higher cumulative accuracy than the state of the art. It can also identify the number of languages present in mixed-language documents, the identity of the language and the relative frequency of occurrence of the language's lexicon. The present invention is also much more efficient in operation than the state-of-the-art methods.
SUMMARY OF THE INVENTION
Briefly, according to this invention, there is provided a computer implemented method of determining if a word is from a target language comprising the steps of decomposing the word into a plurality of n-grams and determining if a first n-gram, one or more following n-grams, if present, and a last n-gram match non-overlapping n-gram patterns characteristic of words in the target language. There is further provided a computer method for using regular expressions or finite state automata to identify the language of individual words. This method uses character n-grams of any length (e.g., unigrams, bigrams, trigrams, and so on, not just trigrams) to identify the language of individual words in isolation with high accuracy. Preferably, the method according to this invention uses regular expressions (e.g., from the Perl language) or finite state automata that recognize words as a sequence of non-overlapping n-grams without gaps. Preferably, the method recognizes words by testing a word for a sequence of n-grams without ignoring n-grams at the start or end of the word, preferably, without ignoring n-gram gaps or considering overlaps of n-grams anywhere in the word and, more preferably, testing the word for a sequence of n-grams using character n-grams with position restrictions (e.g., does an n-gram appear at the beginning, middle or end of the word).
DESCRIPTION OF THE PREFERRED EMBODIMENTS
This invention consists of a computer method for identifying the words of a particular language. As used herein, the term “word” is used in its normal sense to mean a string of characters that as ordered have meaning in a given language. The method has been implemented in the Perl language as described, for example, in
Learning Pern
by Randal L. Schwartz & Tom Christiansen (O'Reilly & Associates, Inc. 1997) with a matching expression. The matching expression tests a string of characters for an n-gram match at the beginning of the word, followed by one or more of a small set of n-grams within the word, followed by a match at the end of the word. This matching expression, also known as a regular expression in Perl, attempts to split the word int

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for identifying the language of individual words does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for identifying the language of individual words, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for identifying the language of individual words will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2469682

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.