Method and configuration for determining a descriptive...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S255000

Reexamination Certificate

active

06523005

ABSTRACT:

BACKGROUND OF THE INVENTION
Field of the Invention
The invention relates to a method and to a configuration for determining a descriptive feature of a speech signal. Such a method and such a configuration are known from E. G. Schukat-Talamazzini: Automatische Spracherkennung-Grundlagen, statistische Modelle und effiziente Algorithmen [Automatic speech recognition—fundamentals, statistical models and efficient algorithms], Vieweg & Sohn Verlagsgesellschaft mbH, Braunschweig/Wiesbaden 1995, pages 45-74. There, the extraction of a discrete time sequence of feature vectors from the speech signal pursues several goals: the digital representation of the speech sound; the reduction of the data volume; the emphasizing of variable attributes which are helpful in identifying the utterance content (of the spoken sounds and words); and the removal of variable attributes which characterize the speaker, the accent, environmental influences and acoustic and electric transmission properties.
In general, feature vectors of relevant pattern classes of the field of application are to occupy compact zones of the feature space, and it is to be possible to separate the zones of different pattern classes from one another as sharply as possible. Known techniques of obtaining features are predominantly based on the combination of methods from digital signal processing, in particular series expansions, with functional models for the production or perception of speech.
After being picked up, the sound wave is present in the form of an electric signal which is described by a real, continuous target function {tilde over (f)}(t). The range of definition and the range of values of the signal must be discretized for the purpose of further processing on a computer. Sampling the target function {tilde over (f)}(t) at discrete interpolation points leads to loss of information. However, if {tilde over (f)}(t) satisfies a spectral band limitation, the function can be reconstructed from its samples if the sampling frequency is selected to be sufficiently high.
Sound waves are nonstationary signals, their spectral properties vary from sound to sound. Even intraphonetically, the dynamics of the articulation gestures effect continuous (in the case of diphthongs) and abrupt (in the case of plosives and affricates) variations in the sound structure. The speech signal can be regarded as approximately stationary only over very short time intervals lasting approximately 5-30 ms.
It is not necessary to calculate short-term features of the speech signal at each sampling instance m. A windowed segment of the speech signal of the order of magnitude of 25 ms is moved through the speech signal at an advancing time of 10 ms. A feature vector is produced in this case for each instant of 10 ms. At each 10 ms instant, the values of the data window (25 ms) are analyzed for their spectral and periodic properties and are stored in the form of the feature vector.
Hidden Markov models (HMM) for modeling sounds are also known from E. G. Schukat-Talamazzini: Automatische Spracherkennung-Grundlagen, statistische Modelle und effiziente Algorithmen [Automatic speech recognition—fundamentals, statistical models and efficient algorithms], Vieweg & Sohn Verlagsgesellschaft mbH, Braunschweig/Wiesbaden 1995, pages 125-139. As a word is being produced in speech, the constituent sounds are realized with a variable duration and in a different spectral composition. A number of feature vectors which cannot be predicted occurs for each individual phonetic segment of the utterance, depending on the rate and rhythm of speech. In addition to its phonetic content, each vector also includes information components conditioned by the speaker, environment and slurring, and these substantially complicate phonetic identification.
These relationships can be modeled in a simplified fashion by a two-stage process as is shown in
FIG. 1
using, as an example, the German word “haben”. Reserved in the model for the phonemes of the word is a corresponding number of states
102
to
106
which are run through in the direction of the arrow
101
to produce speech. With each time cycle, it is possible to remain in the current state or to make a transition to the successor state. The system behaves statistically and is determined by the transition probabilities
107
to
111
illustrated. Thus, the state
103
belonging to the phoneme /a/ is adopted over a plurality (on average over ten) of successive short-term analysis intervals, whereas realizations of the plosive /b/ require less time.
Whereas the described first stage of the random process models the temporal distortion of different variant pronunciations, a second stage serves to detect spectral variations. Each state of the word model is associated with a statistical output function which weights the phonetic alternative realizations. In the example of
FIG. 1
, in addition to the actual matching phonetic class
113
, the phonetic class
114
with a positive probability (here: 0.1) is permitted for producing the phoneme /a/. The phonetic class
118
for producing the phoneme
/ with a probability of 0.3 is also permitted. The described formalism also allows for description of an optional sound elimination, expressed by the “bridging”
119
of the state
105
by a direct transition between the states
104
and
106
. The bridge is assigned with a probability of 0.2, for example. The transition probabilities of the hidden Markov model can be determined using training data. The finally trained HMM then constitutes a rule for producing sound sequences (compare E. G. Schukat-Talamazzini: Automatische Spracherkennung-Grundlagen, statistische Modelle und effiziente Algorithmen [Automatic speech recognition—fundamentals, statistical models and efficient algorithms], Vieweg & Sohn Verlagsgesellschaft mbH, Braunschweig/Wiesbaden 1995, pages 127-139). One method for training the HMM is to use the Baum-Welch algorithm.
It may be remarked in this regard that a trained HMM can be used both for speech recognition, that is to say to compare a natural-speech utterance with the model, and for speech synthesis, that is to say to produce a sound with the aid of the training data.
The 10 ms segments for feature vectors mentioned at the beginning are not sufficient, in particular, for speech synthesis. However, with the known mechanisms, a much finer temporal subdivision leads to a lack of convergence in the HMM training.
SUMMARY OF THE INVENTION
It is accordingly an object of the invention to provide a configuration and a method for determining a descriptive feature of a speech signal which overcomes the above-mentioned disadvantageous of the prior art apparatus and methods of this general type. In particular, it is an object of the invention to obtain a descriptive feature of a speech signal which still supplies meaningful features at a high sampling rate.
With the foregoing and other objects in view there is provided, in accordance with the invention, a method for determining a descriptive feature of a speech signal, that includes steps of: training a first speech model with a first time pattern; training a second speech model with a second time pattern; and initializing the second speech model with the first speech model.
In accordance with an added feature of the invention, the second time pattern is smaller than the first time pattern.
One advantage consists in that, because of the initialization with the knowledge gained from the first speech model, the second speech model also converges for a very small second time pattern, and thus correspondingly highly resolving information of the speech signal is available.
This information is useful precisely for speech synthesis, since the transition, which is difficult to synthesize, between the sounds is more accurately modeled by the higher (temporal) resolution.
Generally, in this case time pattern is understood as the repetition rate at which sampling of the speech signal is performed or at which the time window (specified at the beginning as having a wi

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and configuration for determining a descriptive... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and configuration for determining a descriptive..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and configuration for determining a descriptive... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3156142

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.