Data processing: speech signal processing – linguistics – language – Linguistics
Patent
1995-02-16
1999-02-02
Weinhardt, Robert A.
Data processing: speech signal processing, linguistics, language
Linguistics
704 2, G06F 1728
Patent
active
058678115
DESCRIPTION:
BRIEF SUMMARY
The invention relates to methods and apparatuses for processing a bilingual or multi-lingual database comprising aligned corpora, methods and apparatuses for automatic translation using such databases.
Aligned corpora are two (or more) bodies of text divided into aligned portions, such that each portion in a first language corpus is mapped onto a corresponding portion in a second language corpus. Each portion may typically comprise a single sentence or phrase, but can also comprise one word or perhaps a whole paragraph. The aligned corpora can be used as a database in automated translation systems in which, given a word, phrase or sentence in a first language, a corresponding translation in a second language may be obtained automatically, provided it matches or in some way resembles a portion already present in the database. This principle may be extended, such that more than two corpora are aligned to allow translation into several languages.
In the 1950s and 60s it was a common belief that the development of an all-purpose translating system would be possible in the near future. It was later realised, however, that such a system was much further in the future and possibly it would never be implemented, due to the vast quantity of background information and "intelligence" required. But it was also appreciated that aligned corpora could be used to automate translation within small, specialised fields. This follows, for example, because "problem words" which have many different meanings, would tend to have a much limited range of meanings within the confines of a specialist field of activity.
In creating such specialised translation systems, however, the problem remains of generating high quality aligned corpora in the first place, given in particular that the database generated for one field of activity should ideally be based upon a large volume of previously translated documents, and would probably not be suitable for application in another field. First, it would be necessary for users working in each field to generate their own databases and this has tended to negate the use of such automated systems, so that reliance continues to be made upon human translators. U.S. Pat. No. 5,140,522 for example, describes a machine translation system in which a database of previously translated sentences is built-up during use, but does not disclose any method of obtaining such a database without the initial effort of a human translator.
To address the above problem, a copending United Kingdom patent application now published as GB-A-2272091 describes an automated system for generating aligned corpora. The contents of GB-A-2272091 are incorporated herein by reference. The automated system responds to the formatting codes that are inserted by word processing apparatus in most documents, for example to indicate a new chapter heading or new entries in a table. For many types of text, including for example instruction manuals for electronic apparatus, the portions of text between such formatting codes are small enough to be used as the aligned portions in aligned corpora. Thus, the system described in the prior application is relatively simple, in that it is not required to judge the meanings of the words, nor parse the text into sentences or smaller units. On the other hand, for a variety of reasons, the resulting alignment will not be perfect, such that the database includes "noise" in the form of incorrect alignments.
Alternative methods of automating the generation of aligned corpora have been described for example by W A Gale and K W Church in "A Program for Aligning Sentences in Bilingual Corpora", and by P F Brown et al in "Aligning Sentences in Parallel Corpora", both in the Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, Calif. The system proposed by Brown et al is described more fully in European patent application EP A-0525470. In these systems, the portions used correspond to sentences, and alignment is performed by comparing the lengths of sentences, either
REFERENCES:
patent: 5140522 (1992-08-01), Ito et al.
patent: 5510981 (1996-04-01), Berger et al.
patent: 5541836 (1996-07-01), Church et al.
"Probabilistic Method of Aligning Sentences with their Translations using Word Cognates"; IBM Technical Disclosure Bulletin, vol. 37. No. 02B, Feb. 1994; p. 509.
"A Program for Aligning Sentences in Bilingual Corpora"; by W.A. Gale et al.; Computational Linguistics; vol. 19, No. 1, Mar. 1993, Cambridge, MA; pp. 75-102.
"La comparaison de grands corpus multilingues comme instrument lexicographique: exemple d'un distionnaire he-breu-anglais/anglais-hebreu etabli semi-automatique" by J. Bajard: Sprache und Datenverarbeitung; vol. 12, No. 2, 1988, pp. 69-73, West Germany.
"Aligning Sentences In Parallel Corpora" by P.F. Brown et al.; Proceedings of the 29th Annual Mtg. of the Assn. for Computational Linguistics, Berkeley, Jun. 18, 1991, N.Y. pp. 169-176.
Canon Europa N.V.
Canon Research Centre Europe Ltd.
Weinhardt Robert A.
LandOfFree
Method, an apparatus, a system, a storage device, and a computer does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method, an apparatus, a system, a storage device, and a computer, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method, an apparatus, a system, a storage device, and a computer will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-1127196