Data processing: speech signal processing – linguistics – language – Linguistics – Multilingual or national language support
Reexamination Certificate
1998-03-19
2001-08-07
Thomas, Joseph (Department: 2747)
Data processing: speech signal processing, linguistics, language
Linguistics
Multilingual or national language support
C704S009000
Reexamination Certificate
active
06272456
ABSTRACT:
TECHNICAL FIELD
This invention generally relates to identifying the language of written text and, more particularly described, relates to identifying a language of a document from a small sample input of the document by using n-gram profiles.
BACKGROUND OF THE INVENTION
As large data networks span the globe to make the online world truly a multinational community, there is still no single human language in which to communicate. Electronic messages and documents remain written in a particular human language, such as German, Spanish, Portuguese, Greek, or English. In many situations, there is a need to quickly identify the human language of a particular document in order to further process the document. For example, identification of the document's human language may help when a user or system attempts to index or classify the document. In another situation, a word processor may need to determine the language of the document in order to use the appropriate spell checking, grammar checking, and language translation tools and libraries.
There are a variety of known methods for identifying the human language of text within an electronic document. In one method, a table is maintained having frequent function words in a variety of human languages. Examples of such frequent function words in the English language may include the words “the,” “a,” “which,” and “you.” For a particular document, a count is performed to determine how many of the frequent function words were found for each language. The language having the most frequent function words is identified as the language of the document. Unfortunately, this method typically requires that the section of the document read when determining the language is very long. This is due to the fact that a large amount of input is required before an accurate determination of the document's language can be made. Furthermore, this method becomes problematic when the number of possible languages increases. As the number of possible languages increases, it becomes more difficult to distinguish between languages.
Another method for identifying a document's language uses a set of predetermined rules regarding the occurrence of particular letters or sequences of particular letters that are unique to a specific human language. For example, the letter “å” is unique to the Swedish language. Thus, any document having the letter “å” is determined to be in the Swedish language. In another example, words ending in the letter sequence “çao” are unique to the Brazilian language. However, as with the previous method, the use of a set of predetermined rules can become problematic with a large number of potential languages. Additionally, this method does not perform well with only a limited or small amount of input text.
A third and popular method for identifying a document's language is known as a “tri-gram” method. In the tri-gram method, training documents representing each language are used to create a table or profile for each language, called a tri-gram language profile for each language. More particularly stated, a three-letter window is slid over a training document in a particular language. As the three-letter window is slid over the training document, the method counts the occurrence of the three-letter sequence appearing in the window. This yields a language profile, called a tri-gram language profile, for the particular language that characterizes the appearance of specific three-letter sequences. This is repeated for all of the languages to provide a set of tri-gram profiles for each language. When attempting to determine the language of an unknown document, a similar three-letter window is slid over the unknown document. For each three-letter sequence within the unknown document, the method seeks to find matching three-letter sequences in each of the tri-gram profiles. If a match is found for a particular language, the frequency information within that languages' tri-gram profile for the matched three-letter sequence is added to a cumulative score for the particular language. In this manner, cumulative scores for each language are incremented as the window is slid over the whole unknown document. The language having the highest cumulative score is then deemed the language of the unknown document.
Unfortunately, the tri-gram method is typically computationally intensive when compared to the other two illustrative methods described above. Furthermore, the tri-gram method may have problems accurately identifying the language of a document based on only a small amount of input text from the document.
Some commentators have suggested variations to improve the tri-gram method, such as using longer training documents. Another known improvement uses a similar method to look for either two letters (bi-gram), four letters (quad gram) or some other predefined number of letters within the window. However, all of the above described methods for identifying a document's language typically require relatively large amounts of input text from the unknown document in order to accurately determine the correct language.
Accurately identifying the language of input text can be problematic when the input text is merely a sentence or some other short length of text. For example, it may be desirable to recognize the language of a search query to a World Wide Web search engine on the Internet. The ability to quickly identify the language of the search query allows such a search engine to limit the search to documents that match the language of the search query. Therefore, there is a need for a system for identifying the language of a sample input of text that (1) is quick, (2) remains accurate, and (3) can accurately identify the language when the sample input is very short.
SUMMARY OF THE PRESENT INVENTION
The present invention satisfies the above-described needs by providing a system and method for using multiple sets of language profiles, such as n-gram language profiles, to more accurately identify the language of a sample input of written text. The present invention is especially useful and advantageous when attempting to identify the language of a very short sample input, such as a single sentence or even a word. In general, a reference letter sequence is a sequence of letters, also referred to as an n-gram, that appears with a certain frequency in a particular language. An n-gram profile, more generally referred to as a language profile, maintains a variety of different reference letter sequences (n-grams) of a particular length (such as 3-grams or 4-grams) along with their associated frequencies of occurrence. The frequency at which a reference letter sequence occurs is more generally referred to as a frequency parameter. As such, n-gram profiles include n-gram profiles for each language being considered and may include n-gram profiles having different associated lengths, such as a set of 3-gram profiles and a set of 4-gram profiles for a variety of languages.
In general, the present invention provides a method for using a plurality of n-gram profiles to more accurately identify a language for a sample input of text. Typically, a length of the largest n-gram profile is determined, such as four letters or five letters in length. A window of letters is identified within the sample input, preferably being the length of the largest n-gram profile. If the window contains one or more matches when compared to the reference letter sequences in the n-gram profiles for each language, then the longest match is kept. Typically, the matches contain the first letter within the window.
The longest match is scored as a score for each language based upon a frequency parameter maintained in the n-gram profiles, preferably the n-gram profiles having a length matching the length of the longest match. The frequency parameter is related to the longest match and typically identifies how many times the longest match appears in training data for the particular n-gram profile. Once the longest match has been scored, the window is shifted within the sample inpu
Kilpatrick & Stockton LLP
Microsoft Corporation
Thomas Joseph
LandOfFree
System and method for identifying the language of written... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for identifying the language of written..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for identifying the language of written... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2540177