Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
1999-03-30
2001-07-24
Dorvil, Richemond (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
C704S267000
Reexamination Certificate
active
06266638
ABSTRACT:
BACKGROUND
This relates to speech synthesis and, more particularly, to databases from which sound units are obtained to synthesize speech.
While good quality speech synthesis is attainable using concatenation of a small set of controlled units (e.g. diphones), the availability of large speech databases permits a text-to-speech system to more easily synthesize natural sounding voices. When employing an approach known as unit selection, the available large variety of basic units with different prosodic characteristics and spectral variations reduces, or entirely eliminates, the prosodic modifications that the text-to-speech system may need to carry out. By removing the necessity of extended prosodic modifications, a higher naturalness of the synthetic speech is achieved.
While having many different instances for each basic unit is strongly desired, a variable voice quality is not. If it exists, it will not only make the concatenation task more difficult but also will result in a synthetic speech with changing voice quality even within the same sentence. Depending on the variability of the voice quality of the database, a synthetic sentence can be perceived as being “rough,” even if a smoothing algorithm is used at each concatenation instant, and even perhaps as if different speakers utter various parts of the sentence. In short, inconsistencies in voice quality within the same unit-selection speech database can degrade the overall quality of the synthesis. Of course, the unit selection procedure can be made highly discriminative to disallow mismatches in voice quality but, then, the synthesizer will only use part of the database, while time (and money) was invested to make the complete database available (recording, phonetic labeling, prosodic labeling, etc.).
Recording large speech databases for speech synthesis is a very long process, ranging from many days to months. The duration of each recording session can be as long as 5 hours (including breaks, instructions, etc.) and the time between recording sessions can be more than a week. Thus, the probability of variations in voice quality from one recording session to another (inter-session variability) as well as during the same recording session (intra-session variability) is high.
The detection of voice quality differences in the database is a difficult task because the database is large. A listener has to remember the quality of the voice from different recording sessions, not to mention the shear time that checking a complete store of recordings would take.
The problem of assessing voice quality and its correction have some similarity to speaker adaptation problems in speech recognition. In the latter, “data oriented” compensation techniques have been proposed that attempt to filter noisy speech feature vectors to produce “clean” speech feature vectors. However, in the recognition problem, it is the recognition score that is of interest, regardless of whether the adapted speech feature vector really matches that of “clean” speech or not.
The above discussion clearly shows the difficulty of our problem: not only is automatic detection of quality desired, but any modification or correction of the signal has to result in speech of very high quality. Otherwise the overall attempt to correct the database has no meaning for speech synthesis. While consistency of voice quality in a unit-selection speech database is, therefore, important for high-quality speech synthesis, no method for automatic voice quality assessment and correction has been proposed yet.
SUMMARY
To increase naturalness of concatenative speech synthesis, a database of recorded speech units that consists of a number of recording sessions is processed, and appropriate segments of the sessions are modified by passing the signal of those sessions through an AR filter. The processing develops a Gaussian Mixture Model (GMM) for each recording session and, based on variability of the speech quality within a session, based on its model, one session is selected as the preferred session. Thereafter, all segments of all recording sessions are evaluated based on the model of the preferred session. An assessment of the difference between the average power spectral density of each evaluated segment is compared to the power spectral density of the preferred session, and from this comparison, AR filter coefficients are derived for each segment so that, when the speech segment is passed through the AR filter, its power spectral density approaches that of the preferred session.
REFERENCES:
patent: 4624012 (1986-11-01), Lin et al.
patent: 4718094 (1988-01-01), Bahl et al.
patent: 5271088 (1993-12-01), Bahler
patent: 5689616 (1997-11-01), Li
patent: 5860064 (1999-01-01), Henton
patent: 5913188 (1999-06-01), Tzirkel-Hancock
patent: 6144939 (2000-11-01), Parson et al.
patent: 6163768 (2000-12-01), Sherwood et al.
S. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice Hall, p. 198, No date.
Dempster et al, Maximum Likelihood from Incomplete Data, Royal Statistical Society meeting, Dec. 8, 1979, pp. 1-38.
AT&T Corp
Dorvil Richemond
LandOfFree
Voice quality compensation system for speech synthesis based... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Voice quality compensation system for speech synthesis based..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Voice quality compensation system for speech synthesis based... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2566539