Cross-lingual retrieval system and method that utilizes...

Data processing: speech signal processing – linguistics – language – Linguistics – Translation machine

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S008000, C707S793000, C707S793000

Reexamination Certificate

active

06321189

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a cross-lingual retrieval system for executing retrieval between a first language and a second language, in particular, relates to a cross-lingual retrieval system that uses a set of pairs each having a first language sentence and a second language having the same meaning (hereinafter, each pair is referred to as a pair data) to retrieve first language sentences according to a query written in the first language and then perform similar sentence retrieval of second language sentences which are similar to second language sentences paired with the retrieved first language sentences.
2. Discussion of the Related Art
With the improvement of the performance of computers, development of electronic dictionaries and progress of technology in natural language processing, many machine translation techniques have been proposed.
However, a machine translation system with a translation capability of sufficient accuracy has not yet been realized.
[Related Art 1]
In a proposed system, a large number of sentence pairs each having an original language (first language) sentence and a sentence translated from the original language into another language (second language) are prepared. A first language sentence is input to the system and similar sentences are retrieved from the first language sentences in the sentence pairs. Based on the retrieved first language sentences, corresponding second language sentences are then retrieved from the sentence pairs. A user can refer to the second language sentences output from the system and can improve a quality of translation from the first language sentence into the second language sentence.
For obtaining sentences similar to the first language sentence input to the system from the set of first language sentences in the sentence pairs, a method of determining a sentence of high similarity based on the number of words commonly included in the input sentence and sentences to be retrieved has been suggested. Also, Japanese Patent Application Laid-Open No. 9-50435 (1997) discloses a method of determining a first language sentence having a vector close to the vector corresponding to the input first language sentence as the sentence of high similarity based on the vector space model, one of the similar document retrieving methods.
A method of obtaining a sentence having high similarity to an input sentence according to the vector space model, described in “Information Retrieval (a Japanese translation of “New Horizons in Information Retrieval”)”, David Ellis, 1990, pp. 53-57, is now explained.
In the vector space model, each of the sentence to be an objet of retrieval and the sentence input as a query is represented as a vector. Suppose that there are N sentences to be the object of retrieval and M kinds of words (W1, W2, . . . , WM) in the N sentences. Then vectors corresponding to each of the N sentences (S1, S2, . . . , SN) are defined as M-dimensional vectors as shown in the following expression (1). If a word Wj exists in a sentence Si, Tij is 1. If the word Wj does not exist in the sentence Si, Tij is 0.
S
1=(
T
11
, T
12
, . . . , T
1
M
),
S
2=(
T
21
, T
22
, . . . , T
2
M
),
SN
=(
TN
1
, TN
2
, . . . , TNM
)  (1)
In a similar way, the vector corresponding to a query Q is defined as shown in the following expression (2). If a word Wi exists in the query Q, Ti is 1. If the word Wi does not exist in the query Q, Ti is 0. Here, it is assumed that each element of the vector takes 1 or 0, namely, a binary value. However, it may be possible to allot a real numeric value to each element in accordance with a degree of importance of the word in the sentence.
Q
=(
T
1
, T
2
, . . . , TM
)  (2)
In the vector space model, a sentence Si corresponding to the vector Si which has a close distance to the vector Q is determined to be the sentence having a high similarity to the query Q. Sentences are output in order of descending degree of importance as a result of retrieval. The distance D (Q, Si) between the vector Q and the vector Si is calculated in accordance with the following expression (3). Here, an expression (V, U) represents an inner product of a vector V and a vector U.
In the vector space model, ordinary, the words W1, W2, . . . , WM used for calculation are limited to content words. Function words such as postpositional particles (a part-of-speech in Japanese grammar) and auxiliary verbs are not taken into account. Moreover, a general word such as verb “be” in English (namely, a stop word) is not taken into account though it is the content word.
D
(
Q, Si
)=(
Q, Si
)/((
Q, Q
)(
Si, Si
))
½
  (3)
[Related Art 2]
To obtain the same effect as the above-described [Related Art 1], a method of improving the translation quality has been suggested. In the method, each word in a query written in a first language is automatically converted into a word or a phrase of a second language by using a dictionary, and then a corresponding sentence(s) is retrieved from a set of the second language sentences utilizing the set of converted words or phrases of the second language. Thereby the user can refer to the corresponding second language sentence(s).
However, the above-described [Related Art 1] and [Related Art 2] have problems as follows.
The above-described [Related Art 1] obtains a similar first language sentence(s) based only on the words contained in the query of the first language. Therefore, although a second language sentence adequate as a translation of the query of the first language is present in the set of second language sentences, it cannot be obtained as a result of the retrieval if the expression of the corresponding first language sentence in the sentence pairs differs from that of the query. The [Related Art 1] is effective only if the sentence pairs contain a sentence composed of a set of words which are the same as those contained in the query of the first language.
The inadequacy becomes more pronounced as the number of the words contained in the query becomes smaller. Consequently, in the case where a document including a large number of sentences is input, non-zero elements of the corresponding document vector are increased (dimension of the vector is substantially raised), and accordingly, a highly reliable retrieving result is available. However, in most cases, actual translation data consists of short sentences, and therefore it is practically impossible to obtain adequate translations by the [Related Art 1].
As an example, a case is considered in which the first language is Japanese, the second language is English, and a Japanese sentence “
” (having much the same sense as “It is gradually tapered.”) is input. The content words extacted from the sentence are “
” (gradually) and “
” (tapered). The verb “
” (be (not absolutely precise)) is the stop word and is eliminated from the following explanation.
According to [Related Art 1], Japanese sentences containing both “
” and “
” are obtained as sentences similar to the above query. However, it is impossible to obtain the sentences acceptable as adequate translations, having different expressions (using different words) but the same meaning as the query, such as the following examples (a) and (b).
“It tapers down to a point”.  (a)
“It tapers into a sharp point”.  (b)
The above-described [Related Art 2] obtains second language sentences to be referred to by converting each word in the query of the first language into a word or phrase of the second language by utilizing the dictionary.
However, a word of the first language can be expressed by a variety of words or phrases of the second language. Further, selection of the second language word adequate to substitute for the first language word depends on the context of the query of the first language and it is practically impossible to determine the words to be selected for subs

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Cross-lingual retrieval system and method that utilizes... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Cross-lingual retrieval system and method that utilizes..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Cross-lingual retrieval system and method that utilizes... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2583228

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.