Method and system for extracting pairs of multilingual...

Data processing: speech signal processing – linguistics – language – Linguistics – Multilingual or national language support

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Method and system for extracting pairs of multilingual... Method and system for extracting pairs of multilingual...

: 1998-05-15
: 2001-05-22
: Thomas, Joseph (Department: 2747)
: Data processing: speech signal processing, linguistics, language
: Linguistics
: Multilingual or national language support

: C707S793000
: Reexamination Certificate
: active
: 06236958
: ABSTRACT:

TECHNICAL FIELD
The present invention relates to a method and apparatus for creating bilingual terminology. Specifically, the invention relates to machine translation systems, terminology management systems, and any other systems which make use of multilingual terminology.
BACKGROUND ART
Identification of multilingual terminology can be seen as a process whereby a unit of text U
1
(a word or sequence of words) in a source text T
1
is put in correspondence with a related unit U
2
in a target text T
2
that is the translation of T
1
, such as U
2
is the translation of U
1
. In the past, this process was a manual operation performed by human terminologists in order to build terminology databases. The automation of such a process is commonly referred to as alignment.
Alignment is usually performed through statistical methods. The article of Brown et al. (June 1991) titled “Aligning sentences in parallel corporal”, Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, Calif., discloses a method wherein association scores are computed between the text units in different languages, and then the optimal combination of multilingual text units based on these scores is selected.
The drawbacks of such methods are that noise and silence are generated. Noise relates to multilingual associations which are found but are either wrong or not relevant, such as (dog,aboyer), where “aboyer” (to bark) is indeed related to dogs but is not a translation of the word “dog”, while silence relates to some otherwise relevant multilingual associations which are present in the text but not found.
Furthermore alignment can be processed at different levels of the text depending on the size of the text units that are to be aligned, e.g. it can be done at the level of files, paragraphs, sentences, phrases, multiword terms or even single words.
Known systems that perform alignment of words or multiword terms generally rely upon the existence of texts that are already aligned at sentence level.
UK Patent Application 2,279,164 discloses a system for processing a bilingual database wherein aligned corpora (i.e. collections of texts) are generated or received from an external source. Each corpus comprises a set of text portions aligned with corresponding portions of the other corpus so that aligned portions are nominally translations of one another in two natural languages. A statistical database is compiled. An evaluation module calculates correlation scores for pairs of words chosen one from each corpus. Given a pair of text portions (one in each language) the evaluation module combines word pair correlation scores to obtain an alignment score for the text portions. These alignment scores can be used to verify a translation and/or to modify the aligned corpora to remove improbable alignments. The invention employs statistical techniques, and in particular embodiments allows a probability-based score to be derived to measure the correlation of bilingual word pairs.
However, this technique is limited to the alignment of single words, one word in the source language and one word in the target language. And it suffers the aforementioned problem of noise and silence related to the use of certain statistical scores.
Different methods have been proposed for the alignment at the multiword terms level. Gaussier et al., in “Some methods for the extraction of bilingual terminology”, Proceedings of New Methods in Language Processing, Manchester, 1994, describe several alignment methods based on a monolingual identification of the multiword terms (e.g. by identifying words that have a high likelihood to be associated together), followed by the identification of biligual correspondences between these multiword terms through statistical scores. However, use of these methods is limited to terms composed of exactly two words in the source and target languages.
Some systems eliminating the aforementioned limitation use simple grammars in order to identify multiword terms in each language. For example, the paper of Gaussier et al. (1994) describes a system using linguistic patterns such as “adjective+noun” or “noun+preposition+noun” that characterize the structure of nominal terms in English and French.
While addressing the previous problem, the efficiency of such systems is not maximum and noise is generated because only a small portion of the noun-phrases thus identified turn out to be terms, i.e. units which express a concept of the domain. For example, the expression “following page” could be extracted as being a term in a “adjective+noun” grammar, while it is clear that this is a pervasive phrase in any technical text.
Furthermore, some silence is also generated since the scope of the linguistic patterns is limited to a certain number of expressions and will ignore certain structures that can yield terms, either because they are nonstandard word combinations (such as antenne parabolique de réception in French, where the adjective parabolique is masking the original noun+prep+noun, antenna de réception) or because the grammar failed to identify certain word part-of-speech due to the amibguity of certain words (for example microphone gain could be missed should the grammar consider gain as a verb instead of a noun).
Finally, among the cited problems of each method, none of the previous systems allow for the extraction of a one-to-many term alignment, such as for example the term “baseband” in English corresponding to the term “bande de base” in French.
Accordingly, it would be desirable to be able to provide a new system for automatically extracting multilingual terminology which eliminates the aforementioned problems.
SUMMARY OF THE INVENTION
It is an object of the invention to improve over existing bilingual word or term extraction methods and systems by taking into account different term lengths and by improving the accuracy of the extraction.
It is another object of the invention to provide a system and a method for automatically creating multilingual terminology. The above objects are achieved by employing a computer based terminology extraction system for creating bilingual terminology from a source text aligned with a target text.
The source text comprises at least one sequence of source terms, a term being composed of at least one word, and the target text comprises at least one sequence of target terms. The system comprises a term extractor means which operates on at least one pair extracted from the aligned texts and consisting of a source sequence aligned with a target sequence. The system is characterized in that the term extractor means comprises means for building a network wherein each node of the network comprises at least one term from the pair of aligned source/target sequences, and such that any source term is included within one source node, whereas each target term is included within one target node. The term extractor also comprises means for linking each node consisting of at least one source term with each node consisting of at least one target term. A term statistics means is coupled to the term extractor means for computing an association score for each pair of linked source/target terms, and a memory means is coupled to the term statistics means for storing the scored pairs of linked source/target terms that are considered relevant bilingual terms.
In order to select some links which correspond to potential bilingual terms, in a preferred embodiment the system further comprises means for operating a flow optimization algorithm, such that each link between a source node and a target node is characterized by a capacity and a flow, and such that these means allow for the selection of preferred links having maximum flow at minimum cost.
For a text having a plurality of sequences of source terms aligned with sequences of target terms, the method is operated successively on each pairs of sequences with the following steps of:
a) reading a first pair of aligned sequences of source and target terms;
b) building a network wherein eac

Affiliated with

Gaussier Eric

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Lange Jean-Marc

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

International Business Machines - Corporation

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

Schecter Manny W.

Attorney

[ 0.00 ] – not rated yet Voters 0 Comments 0

Thomas Joseph

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and system for extracting pairs of multilingual... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for extracting pairs of multilingual..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for extracting pairs of multilingual... will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-2455085

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure