Method and configuration for forming classes for a language...

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S257000

Reexamination Certificate

active

06640207

ABSTRACT:

BACKGROUND OF THE INVENTION
FIELD OF THE INVENTION
The invention relates to a method and a configuration for forming classes for a language model based on linguistic classes using a computer.
A method for speech recognition is known from the reference by G. Ruske, titled “Automatische Spracherkennung—Methoden der Klassifikation und Merkmalsextraktion” [“Automatic Speech Recognition—Methods of Classification and Feature Extraction”], Oldenbourg Verlag, Munich 1988, ISBN 3-486-20877-2, pages 1-10. It is customary in this case to specify the usability of a sequence of at least one word as a component of word recognition. A probability is one measure of this usability.
A statistical language model is known from the reference by L. Rabiner, B. -H. Juang, titled “Fundamentals of Speech Recognition”, Prentice Hall 1993, pages 447-450. Thus, the probability P(W) of a word sequence W within the framework of speech recognition, preferably large quantities of vocabulary, generally characterizes a (statistical) language model. The probability P(W) (known as word sequence probability) is approximated by an N-gram language model P
N
(W):
P
N

(
W
)
=

i
=
0
n

P

(
w
i
|
w
i
-
1
,
w
i
-
2
,



,
w
i
-
N
+
1
)
,
(0-1)
where
w
i
denotes the ith word of the sequence W with (i=1 . . . n), and
n denotes the number of words w
1
in the sequence W.
What are called bigrams result from equation (0-1) for N=2.
It is also known in speech recognition, preferably in the commercial field, to use an application field (domain) of limited vocabulary. Texts from various domains differ from one another not only with regard to their respective vocabulary, but also with regard to their respective syntax. Training a language model for a specific domain requires a correspondingly large set of texts (text material, text body), which is, however, only rarely present in practice, or can be obtained only with an immense outlay.
A linguistic lexicon is known from the reference by F. Guethner, P. Maier, titled “Das CISLEX-Wörterbuchsystem” [“The CISLEX Dictionary System”], CIS-Bericht [CIS report] 94-76-CIS, University of Munich, 1994. The reference is a collection, available on a computer, of as many words as possible in a language for the purpose of referring to linguistic properties with the aid of a search program. For each word entry (“word full form”), it is possible to extract the linguistic features relevant to this word full form and the appropriate assignments, that is to say the linguistic values.
The use of linguistic classes is known from the reference by P. Witschel, titled “Constructing Linguistic Oriented Language Models for Large Vocabulary Speech Recognition”, 3rd EUROSPEECH 1993, pages 1199-1202. Words in a sentence can be assigned in different ways to linguistic features and linguistic values. Various linguistic features and the associated values are illustrated by way of example in Table 1 (further examples are specified in this reference).
TABLE 1
Examples of linguistics features and linguistic values
Linguistic feature
Linguistic values
Category
substantive, verb, adjective,
article, pronoun, adverb,
conjunction, preposition, etc.
Type of substantive
abstract, animal, as part of the
body, concrete, human, spatial,
material, as a measure, plant,
temporal, etc.
Type of pronoun
demonstrative, indefinite,
interrogative, possessive, etc.
On the basis of linguistic features
(f
1
, . . . , f
m
)  (0-2)
and linguistic values
(v
11
. . . v
1j
) . . . (v
m1
. . . v
mj
)  (0-3)
each word is allocated at least one linguistic class, the following mapping rule F being applied:
(
C
1
, . . . , C
k
)=
F
((
f
1
, v
11
, . . . , v
1j
) . . . (
f
m
, v
m1
, . . . , v
mj
))  (0-4)
where
f
m
denotes a linguistic feature,
m denotes a number of linguistic features,
v
m1
. . . v
mj
denotes the linguistic values of the linguistic feature f
m
,
j denotes the number of linguistic values,
C
i
denotes the linguistic class with i=1 . . . k,
k denotes the number of linguistic classes, and
F denotes a mapping rule (classifier) of linguistic features and linguistic values onto linguistic classes.
The class of the words with linguistic properties which are unknown or cannot be otherwise mapped constitutes a specific linguistic class in this case.
An example is explained below for the purpose of illustrating the linguistic class, the linguistic feature, the linguistic value and the class bigram probability.
The starting point is the German sentence:
“the Bundestag is continuing its debate”.
The article “the (English) or der (German)” (that is to say the first word) can be subdivided in German into six linguistic classes (from now on, only: classes), the classes being subdivided into number, gender and case. The following Table 2 illustrates this correlation:
TABLE 2
Classes C
i
for the German word “der” (in English the word is “the”)
C
i
Category
Number
Gender
Case
C
1
Article
singular
Masculine
nominative
C
2
Article
singular
Feminine
genitive
C
3
Article
singular
Feminine
dative
C
4
Article
plural
Feminine
genitive
C
5
Article
plural
Masculine
genitive
C
6
Article
plural
Neutral
genitive
Table 3 follows similarly for the German substantive “Bundestag” (second word in the above example sentence):
TABLE 3
Classes C
i
for the word “Bundestag”
C
i
Category
Number
Gender
Case
C
7
Substantive
singular
Masculine
nominative
C
8
Substantive
singular
Masculine
accusative
C
9
Substantive
singular
Masculine
dative
It now follows in this example with regard to class bigrams, that is bigrams applied to linguistic classes, that the class C
i
, followed by the class C
7
, constitutes the correct combination of category, number, case and gender with reference to the example sentence. If frequencies of actually occurring class bigrams are determined from prescribed texts, it follows that the above class bigram C
1
-C
7
occurs repeatedly, since this combination is present frequently in the German language, whereas other class bigrams, for example the combination C
2
-C
8
is not permissible in the German language because of different genders. The class bigram probabilities resulting from the frequencies found in this way are correspondingly high (in the event of frequent occurrence) or low (if not permissible).
The reference by S. Martin, J. Liermann, H. Ley, titled “Algorithms for Bigram and Trigram Word Clustering”, Speech Communication 24, 1998, pages 19-37, proceeds from statistical properties in forming classes. Such classes have no specific linguistic properties which can be appropriately used in the language model.
The conventional formation of classes is performed manually by employing linguists who sort a language model in accordance with linguistic properties. Such a process is extremely lengthy and very expensive, because of the experts.
SUMMARY OF THE INVENTION
It is accordingly an object of the invention to provide a method and a configuration for forming classes for a language model based on linguistic classes which overcome the above-mentioned disadvantages of the prior art methods and devices of this general type, permitting classes to be formed automatically and without the use of expert knowledge for a language model based on linguistic classes.
With the foregoing and other objects in view there is provided, in accordance with the invention, a method for forming classes for a language model based on linguistic classes using a computer. The method includes the steps of using a first mapping rule to determine N classes using a prescribed vocabulary with associated linguistic properties, determining K classes from the N classes by minimizing a language model entropy, and using the K classes to represent a second mapping rule for forming the classes of language models onto the linguistic classes.
In order to achieve the object, a method is specified for forming classes for a language model based on linguistic classes using a computer, in which a first mapping rule is used to determine a number N o

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and configuration for forming classes for a language... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and configuration for forming classes for a language..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and configuration for forming classes for a language... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3174111

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.