Vocabulary and/or language model training

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C704S270100, C704S251000

Reexamination Certificate

active

06430551

ABSTRACT:

BACKGROUND OF THE INVENTION
The invention relates to a method for creating a vocabulary and/or statistical language model from a textual training corpus for subsequent use by a pattern recognition system.
The invention further relates to a system for creating a vocabulary and/or a statistical language model for subsequent use by a pattern recognition system; the system comprising means for creating the vocabulary and/or statistical language model from a textual training corpus.
The invention also relates to a pattern recognition system for recognising a time-sequential input pattern using a vocabulary and/or statistical language model; the pattern recognition system comprising the system for creating a vocabulary and/or statistical language model from a textual training corpus.
Pattern recognition systems, such as large vocabulary continuous speech recognition systems or handwriting recognition systems, typically use a vocabulary to recognise words and a language model to improve the basic recognition result.
FIG. 1
illustrates a typical large vocabulary continuous speech recognition system
100
[refer L. Rabiner, B-H. Juang, “Fundamentals of speech recognition”, Prentice Hall 1993, pages 434 to 454]. The system
100
comprises a spectral analysis subsystem
110
and a unit matching subsystem
120
. In the spectral analysis subsystem
110
the speech input signal (SIS) is spectrally and/or temporally analysed to calculate a representative vector of features (observation vector, OV). Typically, the speech signal is digitised (e.g. sampled at a rate of 6.67 kHz.) and pre-processed, for instance by applying pre-emphasis. Consecutive samples are grouped (blocked) into frames, corresponding to, for instance, 32 msec. of speech signal. Successive frames partially overlap, for instance, 16 msec. Often the Linear Predictive Coding (LPC) spectral analysis method is used to calculate for each frame a representative vector of features (observation vector). The feature vector may, for instance, have 24, 32 or 63 components. In the unit matching subsystem
120
, the observation vectors are matched against an inventory of speech recognition units. A speech recognition unit is represented by a sequence of acoustic references. Various forms of speech recognition units may be used. As an example, a whole word or even a group of words may be represented by one speech recognition unit. A word model (WM) provides for each word of a given vocabulary a transcription in a sequence of acoustic references. For systems, wherein a whole word is represented by a speech recognition unit, a direct relationship exists between the word model and the speech recognition unit. Other systems, in particular large vocabulary systems, may use for the speech recognition unit linguistically based sub-word units, such as phones, diphones or syllables, as well as derivative units, such as fenenes and fenones. For such systems, a word model is given by a lexicon
134
, describing the sequence of sub-word units relating to a word of the vocabulary, and the sub-word models
132
, describing sequences of acoustic references of the involved speech recognition unit. A word model composer
136
composes the word model based on the subword model
132
and the lexicon
134
.
FIG. 2A
illustrates a word model
200
for a system based on whole-word speech recognition units, where the speech recognition unit of the shown word is modelled using a sequence of ten acoustic references (
201
to
210
).
FIG. 2B
illustrates a word model
220
for a system based on sub-word units, where the shown word is modelled by a sequence of three sub-word models (
250
,
260
and
270
), each with a sequence of four acoustic references (
251
,
252
,
253
,
254
;
261
to
264
;
271
to
274
). The word models shown in
FIG. 2
are based on Hidden Markov Models, which are widely used to stochastically model speech and handwriting signals. Using this model, each recognition unit (word model or subword model) is typically characterised by an HMM, whose parameters are estimated from a training set of data. For large vocabulary speech recognition systems involving, for instance, 10,000 to 60,000 words, usually a limited set of, for instance 40, sub-word units is used, since it would require a lot of training data to adequately train an HMM for larger units. A HMM state corresponds to an acoustic reference (for speech recognition) or an allographic reference (for handwriting recognition). Various techniques are known for modelling a reference, including discrete or continuous probability densities.
A word level matching system
130
matches the observation vectors against all sequences of speech recognition units and provides the likelihoods of a match between the vector and a sequence. If sub-word units are used, constraints are placed on the matching by using the lexicon
134
to limit the possible sequence of sub-word units to sequences in the lexicon
134
. This reduces the outcome to possible sequences of words. A sentence level matching system
140
uses a language model (LM) to place further constraints on the matching so that the paths investigated are those corresponding to word sequences which are proper sequences as specified by the language model. In this way, the outcome of the unit matching subsystem
120
is a recognised sentence (RS). The language model used in pattern recognition may include syntactical and/or semantical constraints
142
of the language and the recognition task. A language model based on syntactical constraints is usually referred to as a grammar
144
.
Similar systems are known for recognising handwriting. The language model used for a handwriting recognition system may in addition to or as an alternative to specifying word sequences specify character sequences.
The grammar
144
used by the language model provides the probability of a word sequence W=w
1
w
2
w
3
. . . w
q
, which in principle is given by:
P
(
W
)=
P
(
w
1
)
P
(
w
2
|w
1
).
P
(
w
3
|w
1
w
2
) . . .
P
(
w
q
|w
1
w
2
w
3
. . . w
q
).
Since in practice it is infeasible to reliably estimate the conditional word probabilities for all words and all sequence lengths in a given language, N-gram word models are widely used. In an N-gram model, the term P(w
j
|w
1
w
2
w
3
. . . w
j−1
) is approximated by P(w
j
|w
j−N+1
. . . w
j−1
). In practice, bigrams or trigrams are used. In a trigram, the term P(w
j
|w
1
w
2
w
3
. . . W
j−1
) is approximated by P(w
j
|w
j−2
w
j−1
).
The invention relates to recognition systems which use a vocabulary and/or a language model which can, preferably automatically, be build from a textual training corpus. A vocabulary can be simply retrieved from a document by collecting all different words in the document. The set of words may be reduced, for instance, to words which occur frequently in the document (in absolute terms or relative terms, like relative to other words in the document, or relative with respect to a frequency of occurrence in default documents).
A way of automatically building an N-gram language model is to estimate the conditional probabilities P(w
j
|w
j−N+1
. . . w
j−1
) by a simple relative frequency: F(w
j−N+1
. . . w
j−1
w
j
)/F(w
j−N+1
. . . W
j−1
), in which F is the number of occurrences of the string in its argument in the given textual training corpus. For the estimate to be reliable, F(w
j−N+1
. . . w
j−1
w
j
) has to be substantial in the given corpus. One way of achieving this is to use an extremely large training corpus, which covers most relevant word sequences. This is not a practical solution for most systems, since the language model becomes very large (resulting in a slow or degraded recognition and high storage requirements). Another approach is to ensure that the training corpus is representative of many words and word sequences used for a specific recognition task. This can be achieved by

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Vocabulary and/or language model training does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Vocabulary and/or language model training, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Vocabulary and/or language model training will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2883980

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.