Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-01-17
2003-06-17
Banks-Harold, Marsha D. (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S010000
Reexamination Certificate
active
06581034
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates in general to a phonetic distance calculation method for similarity comparison between phonetic transcriptions of foreign words, and more particularly to an improved phonetic distance calculation method which is capable of applying an edit distance measure, generally used for word spelling comparison, to phonetic transcriptions of foreign words, so that the phonetic transcriptions can effectively be retrieved in a document retrieval system.
2. Description of the Prior Art
In order to efficiently utilize a document keeping space with techniques of a computer field being developed, it is common that documents are not kept in the form of paper, but converted into information data and stored in computers.
To this end, there has been proposed a document retrieval system for rapidly retrieving a desired one from the stored documents. The document retrieval system presents all documents containing similar contents using keywords, resulting in an increase in convenience to the user.
On the other hand, with various exchanges with foreign countries recently increasing, phonetic transcriptions of many foreign words have been used in Korean documents. Most of the phonetic transcriptions are concerned with proper nouns or technical terms originally expressed in English. In particular, it is common that scientific and technological fields have no choice but to employ the phonetic transcriptions, because there is no Korean translation for such English technical terms. However, there is a severe individual difference in the phonetic transcriptions of the foreign words, thus making it difficult to retrieve Korean document texts on the basis of such phonetic transcriptions.
For example, three Korean phonetic transcriptions such a “z,
1
”, “z,
2
” and “z,
3
” may be used together with respect to an English technical term “digital”. Among these Korean phonetic transcriptions, the “z,
4
” has been proposed as a standard, but the “z,
2
” has actually been more frequently used and, occasionally, the “z,
3
” has been used according to private views.
For this reason, documents with various phonetic transcriptions may not often be retrieved unless a diversity of the phonetic transcriptions is considered in the document retrieval.
In order to overcome such a problem, there has been proposed a method for grouping various Korean phonetic transcriptions derived from the same foreign word into an equivalence class and automatically expanding them upon document retrieval [see: Jeong, K. S., Kwon, Y. H., and Myaeng, S. H., “The Effect of a Proper Handling of Foreign and English Words in Retrieving Korean Text”, In Proceedings of the 2nd International Workshop on Information Retrieval with Asian Languages (IRAL '97), 1997].
The creation of such a phonetic transcription equivalence class requires a method for determining whether two given phonetic transcriptions are derived from the same foreign word, namely, for comparing a similarity between the two phonetic transcriptions.
The above phonetic transcription similarity comparison method is also basically necessary to an approximate search for a phonetic transcription (words of foreign origin) database. For example, the similarity comparison method may be usefully utilized for the search for either firm names or trademarks of words of foreign origin.
Unfortunately, it is the reality that no method has been developed until now for similarity comparison between Korean phonetic transcriptions and an edit distance measure (see: Hall, P. and Dowling, G., “Approximate string matching”, Computing Surveys, Vol. 12, No. 4, pp. 381-402, 1980) or an N-gram metric (see: Zamora, E., Pollock, J., and Zamora, A., “The use of trigram analysis for spelling error detection”, Information Processing & Management, Vol. 17, No. 6, pp 305-316, 1981) has merely been utilized as an approach to the similarity comparison. Either the edit distance measure or N-gram metric is a character string similarity comparison method which is independently applicable to words.
The character string similarity comparison method is to detect whether two given character strings are similar in spelling. Because Korean words are spelled using phonetic symbols, they are liable to be analogously pronounced if they are similar in spelling. In this connection, the character string similarity comparison method may relatively effectively be utilized for similarity comparison between Korean phonetic transcriptions.
Now, a description will be given of a conventional method for similarity comparison between phonetic transcriptions of foreign words.
Fred J. Damerau has proposed a method for assuming that typing errors result from only four cases; (1) insertion of one character, (2) deletion of one character, (3) substitution of one character with a different one and (4) transposition of two adjacent characters, and measuring a similarity between two given words on the basis of the minimum number of typing errors between the two words (see: Damerau, F., “A technique for computer detection and correction of spelling errors”, Communications of the ACM, 7, pp. 171-176, 1964). This metric is typically called a Damerau-Levenshtein metric or an edit distance measure. The minimum number of typing errors between two words s and t can be calculated on the basis of the following recurrent equation (see: Wagner, R. A., “Order-n correction for regular languages”, Communications of the ACM, vol. 17, No. 5, pp. 265-268, 1974):
f(0, 0) =0
f(i, j) =min{
f(i−1, j) + 1,
/*Insertion*/
f(i, j−1) + 1,
/*Deletion*/
f(i−1, j−1) +d(s
i
, t
j
) ,
/*Substitution*/
f(i−2, j−2) +d(s
i−1
, t
j
) +d(S
i
, t
j−1
) +1}
/*Transposition*/
Here, the function d is a distance between two characters and can simply be expressed by the following equation:
d
⁡
(
s
i
,
t
j
)
=
{
0
⁢
⁢
if
⁢
⁢
s
i
=
t
j
1
⁢
⁢
if
⁢
⁢
s
i
≠
t
j
It should be noted that the distance function d may be expressed by a more complex equation according to a desired purpose.
In the case where the above edit distance measure is applied to similarity comparison between Korean phonetic transcriptions, it is effective to consider only the insertion, deletion and substitution because the transposition is valid with respect to only the typing error cases. It is further effective to perform the similarity comparison after the removal of an initial phoneme ‘
’ in the Korean phonetic transcriptions because it has no phonetic value.
The above edit distance measure or N-gram metric is a word spelling comparison method which is independently applicable to words and can relatively effectively be utilized for similarity comparison between Korean phonetic transcriptions. However, this edit distance measure or N-gram metric is not the best for pronunciation similarity comparison. For example, Korean phonetic transcriptions “
” and “
” are very similar in spelling, but come from different English technical terms “digital” and “digit”, respectively. For this reason, the conventional word spelling comparison method has a difficulty in performing similarity comparison between such Korean phonetic transcriptions.
Accordingly, the phonological structure of a foreign language as the origin should be considered for the effective similarity comparison between Korean phonetic transcriptions. For example, a Korean phonetic transcription “
” of an English word “robot” is similar in English-style pronunciation to a Korean phonetic transcription “
” with two different character elements, rather than a Korean phonetic transcription “
” with one different character element. This results from the fact that a final phoneme /t/ of the English word is usually changed to a Korean phoneme /
*/ or “
”, where the symbol * indicates that /
/ is a final consonant.
Consequently, the above-mentioned conventional method is effective in performing the word spelling comp
Choi Key-Sun
Kang Byung-ju
Bachman & LaPointe P.C.
Banks-Harold Marsha D.
Harper V. Paul
Korea Advanced Institute of Science and Technology
LandOfFree
Phonetic distance calculation method for similarity... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Phonetic distance calculation method for similarity..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Phonetic distance calculation method for similarity... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3097728