Data processing: speech signal processing – linguistics – language – Linguistics – Natural language
Reexamination Certificate
1999-01-26
2002-05-28
Edouard, Patrick N. (Department: 2747)
Data processing: speech signal processing, linguistics, language
Linguistics
Natural language
Reexamination Certificate
active
06397174
ABSTRACT:
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method of and an apparatus for processing an input text. The present invention also relates to a method of and an apparatus for performing an approximate translation. The invention further relates to a storage medium. Such methods and apparatuses may be used in natural language processing, document processing and text processing. For instance, such methods and apparatuses may be used as a glossing system which provides translations of words or groups of words in an input text into corresponding words or symbols or groups thereof in a different natural language.
DISCUSSION OF THE RELATED ART
Text in natural languages generally contains words or symbols which are associated with each other to have a meaning which is different from the individual meanings of the words or symbols. Such groups are referred to as “collocations” and must be identified as such if the text is to be processed correctly, for instance to access an index of a dictionary (monolingual, bilingual or multilingual), thesaurus or encyclopaedia.
There are known systems for analysing input text by parsing ie: analysing a sentence to determine the relationship between the words. The use of parsing is effective in optimally labelling a sentence with its collocations. However, this technique generally involves superfluous processing and is computationally complex. This technique also requires a vast amount of knowledge e.g. grammar rules and semantic constraints that related words exert upon each other, to drive it.
Another known technique finds the biggest continuous collocation, where “continuous” in this context means that the words of the collocation are adjacent to each other in the input text. However, such techniques cannot distinguish between collocations of the same length. For instance, in the sentence “Air passes out of the furnace through a pipe.”, there are two collocations each having two words, namely “passes out” and “out of’. This technique cannot decide which of these collocations should be chosen.
A known technique for finding discontinuous collocations is disclosed in EP 0 637 805. This technique uses a part-of-speech tagger to attempt to select the best collocations from input text. Such a technique helps to distinguish between “bus stops” where “stops” is a noun, and “stops at” where “stops” is a verb in the sentence “the bus stops at Grenoble”. However, this technique is not capable of indicating which of these possible collocations is optimal. Further, the technique does not provide a means for finding a consistent labelling of collocations for a sentence.
Although these techniques can determine without inconsistency collocations which do not share the same word from an input text, they cannot identify which is the optimal collocation where two or more possible collocations have one or more words in common. As the above examples illustrate, it is essential to select with a high degree of reliability the correct collocation if the collocation is required to be used, for instance to access an index such as a dictionary.
SUMMARY OF THE INVENTION
According to a first aspect of the invention, there is provided a method of processing an input text comprising a plurality of words, the method comprising the steps of:
deriving from the input text a plurality of sets such that each set comprises at least one of the words of the input text, all of the words of each set are present in the input text, and the words of each if any set containing more than one word constitute a collocation;
assigning to each set a unique relative rank;
comparing each set in order of decreasing relative rank with the input text; and
selecting each set, all of whose words are present in the input text and none of whose words are present in a previously selected set of higher relative rank.
Each of the words of the input text may be present in at least one of the sets.
All of the words of the input text may be present in the union of the selected sets. The term “union” is used in its conventional mathematical sense and means a set containing all the words of the selected sets.
The input text may comprise a grammatically complete sample of text.
The words may comprise basic word forms derived from an original text by linguistic (e.g. morphological) analysis in a preliminary step.
The assigning step may comprise a first step of assigning a priority value which increases with increasing number of words in the set.
The assigning step may comprise a second step of assigning a priority value which decreases with increasing span of the words of the set in the input text. The term “span” means the number of words, including the words of the set themselves, between the word of the set which occurs first in the input text and the word of the set which occurs last in the input text.
The second step may be performed only if the first step results in more than one set having the same priority value.
The assigning step may comprise a third step of assigning a priority value which is dependent on the linguistic relationship between at least one word of the set and at least one word of the input text not in the set.
The third step may be performed only if the second step results in more than one set having the same priority value.
The assigning step may comprise a fourth step of assigning a priority value which increases with position to the right in the input text of the right-most word of the set. This is appropriate for languages such as English which tend to be right-branching.
The fourth step may be performed only if the third step results in more than one set having the same priority value.
The assigning step may comprise a fifth step of assigning a priority value by default.
The fifth step may be performed only if the fourth step results in more than one set having the same priority value.
The assigning step may comprise assigning a priority value based on a measure of probability of each set.
The method may comprise accessing an index of word sets with at least one of the selected sets.
According to a second aspect of the invention, there is provided a method of performing an approximate translation of an input text in a first natural language to a second natural language, comprising performing a method in accordance with the first aspect of the invention, in which the index is a dictionary, such as a bilingual dictionary, and outputting dictionary entries in the second language corresponding to the selected sets.
The first and second languages may be the same language but more usually are different languages.
According to a third aspect of the invention, there is provided an apparatus for processing an input text comprising a plurality of words, the apparatus comprising:
means for deriving from the input text a plurality of sets such that each set comprises at least one of the words of the input text, all of the words of each set are present in the input text, and the words of each if any set containing more than one word constitute a collocation;
means for assigning to each set a unique relative rank;
means for comparing each set in order of decreasing relative rank; with the input text; and
means for selecting each set, all of whose words are present in the input text and none of whose words is present in a previously selected set of higher relative rank;.
The deriving means may be arranged such that each of the words of the input text is present in at least one of the sets.
The selecting means may be arranged such that all of the words of the input text are present in the union of the selected sets.
The input text may comprise a grammatically complete sample of text delimited by punctuation, such as full stops, semi-colons or colons. Examples of such samples are phrases, clauses and sentences.
The words may comprise basic word forms and the apparatus may comprise a linguistic analyser for analysing an original text and providing the basic word forms.
The assigning means may comprise first means for assigning a priority value which increases with increasing number
Ijdens Jan Jaap
Poznanski Victor
Whitelock Peter John
Edouard Patrick N.
Renner Otto Boisselle & Sklar
Sharp Kabushiki Kaisha
LandOfFree
Method of and apparatus for processing an input text, method... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method of and apparatus for processing an input text, method..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method of and apparatus for processing an input text, method... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2916398