Method and array for introducing temporal correlation in...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S256000, C704S241000, C704S240000, C704S251000

Reexamination Certificate

active

06832190

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention is directed to a method and to an arrangement for the recognition of spoken language by a computer.
2. Description of the Related Art
A method and an arrangement for the recognition of spoken language are known from publication, by N. Haberland et al., “Sprachunterricht—wie funktioniert die computerbasierte Spracherkennung!”, c't—Magazin für Computertechnik—May 1998, Heinz Heise Verlag, Hannover, 1998. Particularly until a recognized word sequence is obtained from a digitalized voice signal, a signal analysis and a global search that accesses an acoustic model and a linguistic model of the language to be recognized are implemented in the recognition of spoken language. The acoustic model is based on a phoneme inventory, converted with the assistance of hidden Markov models (HMMs), and on a pronunciation lexicon, converted as a tree lexicon. The linguistic model contains a tri-gram statistics, i.e. a sequence of three words. With the assistance of the acoustic model, the most probable word sequences are determined during the global search for feature vectors that proceeded from the signal analysis and these are output as recognized word sequence. The relationship that has been presented is explained in depth in publication by N. Haberland et al., Sprachunterricht—wie funktioniert die computerbasierte Spracherkennung.
In order to follow the subsequent comments, the terms that are employed shall be briefly discussed here.
As a phase of the computer-based speech recognition, the signal analysis particularly comprises a Fourier transformation of the digitalized voice signal and a feature extraction following thereupon. It proceeds from publication by N. Haberland et al., “Sprachunterricht—wie funktioniert die computerbasierte Spracherkennung?” that the signal analysis ensues every ten milliseconds. From overlapping time segments with a respective duration of, for example, 25 milliseconds, approximately 30 features are determined on the basis of the signal analysis and combined to form a feature vector. For example, given a sampling frequency of 16 kHz, 400 signal amplitude values enter into the calculation of a feature vector. In particular, the components of the feature vector describe the spectral energy distribution of the appertaining signal excerpt. In order to arrive at this energy distribution, a Fourier transformation is implemented on every signal excerpt (25 ms excerpt). The presentation of the signal in the frequency domain is thus obtained and, thus, the components of the feature vector. After the signal analysis, thus, the digitalized voice signal is present in the form of feature vectors.
These feature vectors are supplied to the global search, a further phase of the speech recognition. As already mentioned, the global search makes use of the acoustic model and, potentially, of the linguistic model in order to image the sequence of feature vectors onto individual parts of the language (vocabulary) which are present as a model A language is composed of a given plurality of sounds, that are referred to as a phonemes, whose totality is referred to as phoneme inventory. The vocabulary is modelled by phoneme sequences and stored in a pronunciation lexicon. Each phoneme is modelled by at least one HMM. A plurality of HMMs yield a stochastic automaton that comprises statusses and status transitions. The time execution of the occurrence of specific feature vectors (even within a phoneme) can be modelled with HMMs. A corresponding phoneme model thereby comprises a given plurality of statusses that are arranged in linear succession. A status of an HMM represents a part of a phoneme (for example an excerpt of 10 ms length). Each status is linked to an emission probability, which, in particular, has a Gaussian distribution, for the feature vectors and to transition probabilities for the possible transitions. A probability with which a feature vector is observed in an appertaining status is allocated to the feature vector with the emission distribution. The possible transitions are a direct transition from one status into a next status, a repetition of the status and a skipping of the status.
The joining of the HMM statusses to the appertaining transitions over the time is referred to as a trellis. The principle of dynamic programming is employed in order to determined the acoustic probability of a word: the path through the trellis is sought that exhibits the fewest errors or, respectively, that is defined by the highest probability for a word to be recognized.
Parameters of the emission distributions are determined on the basis of exemplary sets in a training phase.
In addition to the described acoustic model, the language model (also: linguistic model) is also potentially taken into consideration in the global search. The language model has the job of determining the linguistic probability of a set hypothesis. When a sequence of words has no meaning, then this sequence has a correspondingly slight linguistic probability in the language model. In particular, sequences of two words (bi-grams) or of three words (tri-grams) are utilized in order to calculate linguistic probabilities for these bi-grams or, respectively, tri-grams. Due to the nearly arbitrary combination of words of the vocabulary in bi-grams, tri-grams or, respectively, n-grams, a storing of all n-grams is ultimately a question of the available memory.
The result of the global search is the output or, respectively, offering of a recognized word sequence that derives taking the acoustic model (phoneme inventory) and the language model into consideration.
Given an HMM, it is assumed that an emission probability for a feature vector is dependent on only one status. Modelling errors that, according to the above comments, have a significant influence on the recognized word sequence derive as a result of this assumption.
The publication Kenny et al., “Linear Predictive HMM for Vector-Valued Observations with Applications to Speech recognition”, IEEE Transactions on ASSP, Volume 38, 1990, pages 220-225, discloses a method for recognizing spoken language with a computer wherein feature vectors for describing a digitalized voice signal are calculated dependent of a plurality of preceding feature vectors.
SUMMARY OF THE INVENTION
An object of the present invention is to provide an arrangement and a method for speech recognition that enables an improved recognition on the basis of modified hidden Markov models compared to the Prior Art.
This object is achieved a method for recognizing spoken language with a computer, wherein a digitalized voice signal is determined from the spoken language, a signal analysis from which feature vectors for describing the digitalized voice signal proceed is implemented on the digitalized voice signal, a global search for imaging the feature vectors onto a language present in modelled form is implemented, whereby phonemes of the language are described by a modified hidden Markov model, the modified hidden Markov model comprises a conditional probability of a feature vector of a prescribed plurality of prescribed plurality of preceding feature vectors, the conditional probability is approximated by a combination of two separately modelled probabilities, the first separately modelled probability ignores a correlation of the feature vectors, whereas the second separately modelled probability takes the correlation of the feature vectors into consideration, and the spoken language is recognized in that a recognized word sequence is offered by the global search.
For achieving the object, a method for the recognition of spoken language with a computer is recited wherein a digitalized speech signal is determined from the spoken language and a signal analysis is implemented with the digitalized voice signal, whereby feature vectors are determined for the description of the digitalized voice signal. In a global search for imaging the feature vectors on to a language present in modelled form, each phoneme of the lan

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and array for introducing temporal correlation in... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and array for introducing temporal correlation in..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and array for introducing temporal correlation in... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3328994

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.