System and method for identifying language using...

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06415250

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention is directed to human language recognition technology and, more particularly, to a system and method that automatically identifies the language in which a text is written.
2. Related Art
Knowledge of the language of a document, referred to as the source language, can enable text-oriented applications such as word processors, presentation managers, search engines and other applications which process, store or manipulate text, to automatically select appropriate linguistic tools (hyphenation, spell checking, grammar and style checking, thesaurus, etc.). Knowledge of the source language also provides the ability to decide whether a text may need translation prior to display and enables documents to be classified and stored automatically and efficiently according to their source language. With respect to electronic communications, language identification enables communications applications to catagorize, filter and prioritize messages, query hits and electronically-mailed messages and documents according to a preferred source language.
Generally, text-oriented applications are incapable of identifying the source language in which a given text is written. Instead, these applications typically assume that the text is written in a default source language, usually English, unless the language has been explicitly specified. In the event that the text was not written in the default language, any type of linguistic analysis or classification will fail or lead to indeterminant results.
There are several conventional methods that have been used to identify the source language in which an unknown text is written. One such conventional language identification method which is based on what is commonly referred to as trigram analysis is disclosed in U.S. Pat. No. 5,062,143 to Schmitt. A trigram is a sequence of three characters occurring anywhere in a body of text, and may contain blanks (spaces). Schmitt appears to disclose a system that determines a “key set” of trigrams for each language by parsing a sample of text (approximately 5,000 characters) written in each language. The “key set” for a language includes trigrams for which the frequency of occurrence within the unknown text accounts for approximately one third of the total number of trigrams present in the text.
To determine the language of an unknown text, Schmitt parses the text into successive trigrams and then iteratively compares each parsed trigram to the “key set” associated with each language. The number of times each parsed trigram matches a trigram in a key set is counted. Schmitt then calculates a ratio (“hit percent”) of the number of matches to the number of trigrams in the unknown text, and compares this hit percent to a predetermined threshold. If the hit percent for the particular language key set is greater than the predetermined threshold, Schmitt records the hit percent. After all language key sets have been processed, the language corresponding to the key set yielding the highest recorded hit percent is identified to be the language of the text. If there is no hit percent that exceeds the predetermined threshold, then no language identification is made.
There are several disadvantages to such trigram-based language identification methods. First, a significant number of samples of the unknown text need to be obtained. Take, for example, an unknown text containing the word “monitor” (the average word length in the English language is approximately seven characters) which is separated from other words by spaces (denoted “_”). A trigram-based sampling of this word alone requires seven samples: _mo, mon, oni, nit, ito, tor, or_. The iterative comparison of each sampled trigram to the key set of each language is computationally expensive in both time and resources. In addition, because the identification of the unknown text requires a hit percent which exceeds a predetermined threshold, this approach is inherently based upon an assumption that a sufficient amount of sampled data from the unknown text is available to generate enough hits to exceed the threshold. As a result, this method is often found to be ineffective when used on small samples of unknown text, such as a title or header, that contains so few trigrams as to be incapable of exceeding this predetermined threshold.
Another conventional method used to identify the source language in which an unknown text is written is described in U. S. Pat. No. 5,548,507 to Martino et al. Martino appears to teach a language identification process using coded language words. Martino generates Word Frequency Tables (WFTs) each of which is associated with a language of interest. The WFT for a particular language contains relatively few words that are statistically determined to be the most frequent in the given language, based on a very large number of sample documents in that language. The sample documents for the represented language form a training corpus. The WFT also contains a Normalized Frequency Occurrence (NFO) value representing the frequency of occurrence of each word in the language. Associated with each WFT is an accumulator that stores and accumulates NFO values.
To determine the identity of an unknown text, a word from the text is compared to all the words in all of the WFTs. Whenever a word is found in any WFT, that word's associated NFO value is added to a current total in the associated accumulator. In this manner, the totals in the associated accumulators increase as additional words are successively sampled from the unknown text. Processing stops either at the end of the unknown text file, or after a predetermined number of sampled words have been processed. The language corresponding to the accumulator with the highest total NFO value is identified as the language of the text.
The method described in Martino appears to provide advantages over Schmitt's trigram-based language identification. However, according to Martino, the method requires approximately 100 words to be read from the unknown text to identify the language in which it is written, and several hundred words are preferred. In addition to the large number of samples which must be taken, the success of the Martino device is dependent upon the type of unknown text. Because the most frequently occurring words in most languages are predominantly function words such as pronouns, articles and prepositions, this method has limited success when the unknown text does not contain such words. For example, the Martino device appears to have limited success when applied to highly technical documents and small texts such as a title, header or query. Furthermore, significant time and expense is required to generate the word frequency tables.
What is needed, therefore, is a system that efficiently and accurately identifies the language in which a text is written when provided with a relatively few number samples of the text. System performance should not be adversely affected by a limited training corpus and should not depend on the content or length of the unknown text.
SUMMARY OF THE INVENTION
The above and other drawbacks of conventional language identification systems are overcome by the language identification system of the present invention. One aspect of the invention is a language identification system for automatically identifying a language in which an unknown input text is written based upon a probabilistic analysis of predetermined portions of words sampled from the input text which reflect morphological characteristics of natural languages.
In another aspect of the invention an automatic language identification system is disclosed. The automatic language identification system determines in which language of a plurality of represented languages a given text is written. This determination is based upon a value representing the relative likelihood that the text is a particular one of the represented languages due to the presence in the text of a predetermined character string that contains morphological features of th

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for identifying language using... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for identifying language using..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for identifying language using... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2882611

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.