Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
1999-03-09
2001-03-13
Hudspeth, David (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
C704S254000
Reexamination Certificate
active
06202049
ABSTRACT:
BACKGROUND AND SUMMARY OF THE INVENTION
The present invention relates to concatenative speech synthesis systems. In particular, the invention relates to a system and method for identifying appropriate edge boundary regions for concatenating speech units. The system employs a speech unit database populated using speech unit models.
Concatenative speech synthesis exists in a number of different forms today, which depend on how the concatenative speech units are stored and processed. These forms include time domain waveform representations, frequency domain representations (such as a formants representation or a linear predictive coding LPC representation) or some combination of these.
Regardless of the form of speech unit, concatenative synthesis is performed by identifying appropriate boundary regions at the edges of each unit, where units can be smoothly overlapped to synthesize new sound units, including words and phrases. Speech units in concatenative synthesis systems are typically diphones or demisyllables. As such, their boundary overlap regions are phoneme-medial. Thus, for example, the word “tool” could be assembled from the units ‘tu’ and ‘ul’ derived from the words “tooth” and “fool.” What must be determined is how much of the source words should be saved in the speech units, and how much they should overlap when put together.
In prior work on concatenative text-to-speech (TTS) systems, a number of methods have been employed to determine overlap regions. In the design of such systems, three factors come into consideration:
Seamless Concatenation: Overlapping to speech units should provide a smooth enough transition between one unit and the next that no abrupt change can be heard. Listeners should have no idea that the speech they are hearing is being assembled from pieces.
Distortion-free Transition: Overlapping to speech units should not introduce any distortion of its own. Units should be mixed in such a way that the result is indistinguishable from non-overlapped speech.
Minimal System Load: The computational and/or storage requirements imposed on the synthesizer should be as small as possible.
In current systems there is a tradeoff between these three goals. No system is optimal with respect to all three. Current approaches can generally be grouped according to two choices they make in balancing these goals. The first is whether they employ short or long overlap regions. A short overlap can be as quick as a single glottal pulse, while a long overlap can comprise the bulk of an entire phoneme. The second choice involves whether the overlap regions are consistent or allowed to vary contextually. In the former case, like portions of each sound unit are overlapped with the preceding and following units, regardless of what those units are. In the latter case, the portions used are varied each time the unit is used, depending on adjacent units.
Long overlap has the advantage of making transitions between units more seamless, because there is more time to iron out subtle differences between them. However, long overlaps are prone to create distortion. Distortion results from mixing unlike signals.
Short overlap has the advantage of minimizing distortion. With short overlap it is easier to ensure that the overlapping portions are well matched. Short overlapping regions can be approximately characterized as instantaneous states (as opposed to dynamically varying states). However, short overlap sacrifices seamless concatenation found in long overlap systems.
While it would be desirable to have the seamlessness of long overlap techniques and the low distortion of short overlap techniques, to date no systems have been able to achieve this. Some contemporary systems have experimented with using variable overlap regions in an effort to minimize distortion while retaining the benefits of long overlap. However, such systems rely heavily on computationally expensive processing, making them impractical for many applications.
The present invention employs a statistical modeling technique to identify the nuclear trajectory regions within sound units and these regions are then used to identify the optimal overlap boundaries. In the presently preferred embodiment time-series data is statistically modeled using Hidden Markov Models that are constructed on the phoneme region of each sound unit and then optimally aligned through training or embedded re-estimation.
In the preferred embodiment, the initial and final phoneme of each sound unit is considered to consist of three elements: the nuclear trajectory, a transition element preceding the nuclear region and a transition element following the nuclear region. The modeling process optimally identifies these three elements, such that the nuclear trajectory region remains relatively consistent for all instances of the phoneme in question.
With the nuclear trajectory region identified, the beginning and ending boundaries of the nuclear region serve to delimit the overlap region that is thereafter used for concatenative synthesis.
The presently preferred implementation employs a statistical model that has a data structure for separately modeling the nuclear trajectory region of a vowel, a first transition element preceding the nuclear trajectory region and a second transition element following the nuclear trajectory region. The data structure may be used to discard a portion of the sound unit data, corresponding to that portion of the sound unit that will not be used during the concatenation process.
The invention has a number of advantages and uses. It may be used as a basis for automated construction of speech unit databases for concatenative speech synthesis systems. The automated techniques both improve the quality of derived synthesized speech and save a significant amount of labor in the database collection process.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.
REFERENCES:
patent: 5349645 (1994-09-01), Zhoa
patent: 5400434 (1995-03-01), Pearson
patent: 5617507 (1997-04-01), Lee et al.
patent: 5684925 (1997-11-01), Morin et al.
patent: 5751907 (1998-05-01), Moebius et al.
patent: 5913193 (1999-06-01), Huang et al.
patent: 0 805 433 (1997-05-01), None
Mercier, G., D. Bigorgne, L. Miclet, L. LeGuenne, and M. Querre, “Recognition of Speaker-dependent Continuous Speech with KEAL,” IEE Proceedings-Communications, Speech, and Vision, Part I, vol. 136, iss. 2, Apr. 1989, pp. 145-154.
Weigel, Walter, “Continuous Speech-Recognition with Vowel-Context-Independent Hidden Markov Models for Demisyllables,” Proc. ICSLP, Kobe Japan, Nov. 1990, pp. 701-704.
Matsui, K., S. D. Pearson, K. Hata, and T. Kamai, “Improving Naturalness in Text-to-Speech Synthesis Using Natural Glottal Source,” 1991 Int. Conf. Acoust., Speech, Sig. Proc., 1991, ICASSP-91, vol. 2, Apr. 14-17 1991, pp. 769-772.
Boeffard, O., L. Miclet, and S. White, “Automatic Generation of Optimized Unit Dictionaries for text to Speech Synthesis,” Int. Conf. Spoken Language Proc., Banff, Alberta, Canada, vol. 2, Oct. 12-16, 1992, pp. 1211-1241.
Acero, H. Hon, A., Huang, X., Liu, J., and Plumpe, M.; “Automatic Generation Of Synthesis Units For Trainable Text-To-Speech Systems”; Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No. 98CH36181) Part vol. 1; pp. 293-296 vol. 1; May 1998.
Boeffard, O., Miclet, L., and White, S.; “Automatic Generation Of Optimized Unit Dictionaries For Text To Speech Synthesis”; InProceedings ICSLP 92, Baraff, Alberta, Canada; pp. 1211-1214.; 1992.
Conkie, Alistair D., and Isard, Stephen; “Optimal Coupling of Diphones”; Text-To-Speech Synthesis: Progress In Speech Synthesis Workshop; 2nd; pp. 293-304; Spring 1996.
Kibre Nicholas
Pearson Steve
Harness & Dickey & Pierce P.L.C.
Hudspeth David
Matsushita Electric - Industrial Co., Ltd.
Storm Donald L.
LandOfFree
Identification of unit overlap regions for concatenative... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Identification of unit overlap regions for concatenative..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Identification of unit overlap regions for concatenative... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2439400