Method and apparatus for using formant models in speech systems

Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S201000

Reexamination Certificate

active

06505152

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates to speech recognition and synthesis systems and in particular to speech systems that exploit formants in speech.
In human speech, a great deal of information is contained in the first three resonant frequencies or formants of the speech signal. In particular, when a speaker is pronouncing a vowel, the frequencies and bandwidths of the formants indicate which vowel is being spoken.
To detect formants, some systems of the prior art utilize the speech signal's frequency spectrum, where formants appear as peaks. In theory, simply selecting the first three peaks in the spectrum should provide the first three formants. However, due to noise in the speech signal, non-formant peaks can be confused for formant peaks and true formant peaks can be obscured. To account for this, prior art systems qualify each peak by examining the bandwidth of the peak. If the bandwidth is too large, the peak is eliminated as a candidate formant. The lowest three peaks that meet the bandwidth threshold are then selected as the first three formants.
Although such systems provided a fair representation of the formant track, they are prone to errors such as discarding true formants, selecting peaks that are not formants, and incorrectly estimating the bandwidth of the formants. These errors are not detected during the formant selection process because prior art systems select formants for one segment of the speech signal at a time without making reference to formants that had been selected for previous segments.
To overcome this problem, some systems use heuristic smoothing after all of the formants have been selected. Although such post-decision smoothing removes some discontinuities between the formants, it is less than optimal.
In speech synthesis, the quality of the formant track in the synthesized speech depends on the technique used to create the speech. Under a concatenative system, sub-word units are spliced together without regard for their respective formant values. Although this produces sub-word units that sound natural by themselves, the complete speech signal sounds unnatural because of discontinuities in the formant track at sub-word boundaries. Other systems use rules to control how a formant changes over time. Such rule-based synthesizers never exhibit the discontinuities found in concatenative synthesizers, but their simplified model of how the formant track should change over time produces an unnatural sound.
SUMMARY OF THE INVENTION
The present invention utilizes a formant-based model to improve formant tracking and to improve the creation of formant tracks in synthesized speech.
Under one aspect of the invention, a formant-based model is used to track formants in an input speech signal. Under this part of the invention, the input speech signal is divided into segments and each segment is examined to identify candidate formants. The candidate formants are grouped together and sequences of groups are identified for a sequence of speech segments. Using the formant model, the probability of each sequence of groups is then calculated with the most likely sequence being selected. This sequence of groups then defines the formant tracks for the sequence of segments.
Under one embodiment of the invention, the formant tracking system is used to train the formant model. Under this embodiment, the formant track selected for the sequence of segments is analyzed to generate a mean frequency and mean bandwidth for each formant in each formant model state. These mean frequencies and bandwidths are then used in place of the existing values in the formant model.
Another aspect of the present invention is the compression of a speech signal based on a formant model. Under this aspect of the invention, the formant track is determined for the speech signal using the technique described above. The formant track is then used to control a set of filters, which remove the formants from the speech signal to produce a residual excitation signal. Under some embodiments, this residual excitation signal is further compressed by decomposing the signal into a voiced and unvoiced portion. The magnitude spectrums of both of these portions are then compressed into a smaller set of representative values.
A third aspect of the present invention uses the formant model to synthesize speech. Under this aspect, text is divided into a sequence of formant model states, which are used to retrieve a sequence of stored excitation segments. The states are also provided to a formant path generator, which determines a set of most likely formant paths given the sequence of model states and the formant models for each state. The formant paths are then used to control a series of resonators, which introduce the formants into the sequence of excitation segments. This produces a sequence of speech segments that are later combined to form the synthesized speech signal.


REFERENCES:
patent: 4343969 (1982-08-01), Kellett
patent: 4813075 (1989-03-01), Ney
patent: 4831551 (1989-05-01), Schalk et al.
patent: 5042069 (1991-08-01), Chhatwal et al.
patent: 5381512 (1995-01-01), Holton et al.
patent: 5649058 (1997-07-01), Lee
patent: 5701390 (1997-12-01), Griffin et al.
patent: 5729694 (1998-03-01), Holzrichter et al.
patent: 5754974 (1998-05-01), Griffin et al.
patent: 5911128 (1999-06-01), DeJaco
patent: 6006180 (1999-12-01), Bardaud et al.
patent: 0878790 (1998-11-01), None
patent: 64-064000 (1989-09-01), None
patent: WO 9316465 (1993-08-01), None
“Acoustic Parameters of Voice Individually and Voice-Quality Control by Analysis-Synthesis Method,” by Kuwabara et al., Speech Communication 10 North-Holland, pp. 491-495 (Jun. 15, 1991).
“Tracking of Partials for Additive Sound Synthesis Using Hidden Markov Models,” by Depalle et al., 1993 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 225-228 (Apr. 27, 1993).
“A Format Vocoder Based on Mixtures of Gaussians,” by Zolfaghari et al., IEEE International Conference on Acoustic Speech and Signal Processing, pp. 1575-1578 (1997).
“Application of Markov Random Fields to Formant Extraction,” by Wilcox et al., International Conference on Acoustics, Speech and Signal Processing, pp. 349-352 (1990).
“Role of Formant Frequencies and Bandwidths in Speaker Perception,” by Kuwabara et al., Electronics and Communications in Japan, Part 1, vol. 70, No. 9, pp. 11-21 (1987).
“A Family of Formant Trackers Based on Hidden Markov Models,” by Gary E. Kopec, International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 1225-1228 (1986).
“A Mixed-Excitation Frequency Domain Model for Time-Scale Pitch-Scale Modification of Speech”, by Alex Acero, Proceedings of the international conference on spoken Language processing, Sydney, Australia, pp. 1923-1926 (Dec. 1998).
“From Text to Speech: The MITalk System”, by Jonathan Allen et al., MIT Press, Table of Contents pages v-xi, Preface pp. 1-6 (1987).
“Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences”, by Steve B. Davis et al., IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-28, No. 4, pp. 357-366 (Aug. 1980).
“Whistler: A Trainable Text-to-Speech System”, by Xuedong Huang et al., Proceedings of the International Conference on Spoken Language Systems, Philadelphia, PA, pp. 2387-2390 (Oct. 1996).
“An Algorithm for Speech Parameter Generation from Continuous Mixture HMMS with Dynamic Features”, by Keiichi Tokuda et al., Proceedings of the Eurospeech Conference, Madrid, pp. 757-760 (Sep. 1995).
“Extraction of Vocal-Tract System Characteristics from Speech Signals”, by B. Yegnanarayana, IEEE Transactions on Speech and Audio Processing, vol. 6, No. 4, pp. 313-327 (Jul. 1998).
“A New Paradigm for Reliable Automatic Formant Tracking”, by Yves Laprie et al., ICASSP-94, vol. 2, pp. 201-204, (1992).
“System for Automatic Formant Analysis of Voiced Speech”, by Ronald W. Schafer et al.,The Journal of the Acoustical Society of America, vol. 47, No. 2 (Part 2), pp. 634-648, (1970).
Vucetic (“A Hardware Im

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for using formant models in speech systems does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for using formant models in speech systems, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for using formant models in speech systems will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3001850

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.