Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
1998-11-24
2001-06-26
Tsang, Fan (Department: 2645)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
C704S258000
Reexamination Certificate
active
06253182
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to speech synthesis. In particular, the present invention relates to time and pitch scaling in speech synthesis.
Text-to-speech systems have been developed to allow computerized systems to communicate with users through synthesized speech. Concatenative speech synthesis systems convert input text into speech by generating small speech segments for small units of the text. These small speech segments are then concatenated together to form the complete speech signal.
To create the small speech segments, a text-to-speech system accesses a database that contains samples of a human trainer's voice. The samples are generally grouped in the database according to the speech units they are taken from. In many systems, the speech units are phonemes, which are associated with the individual sounds of speech. However, other systems use diphones (two phonemes) or triphones (three phonemes) as the basis for their database.
The number of bits that can be used to describe each sample for each speech unit is limited by the memory of the system. Thus, text-to-speech systems generally cannot store values that exactly describe the training speech units. Instead, text-to-speech systems only store values that approximate the training speech units. This causes an approximation error in the stored samples, which is sometimes referred to as a compression error.
The number of examples of each speech unit that can be stored for the speech system is also limited by the memory of the computer system. Different examples of each speech unit are needed because the speech units change slightly depending on their position within a sentence and their proximity to other speech units. In particular, the pitch and duration of the speech unit, also known as the prosody of the speech unit, will change significantly depending on the speech unit's location. For example, in the sentence “Joe went to the store” the speech units associated with the word “store” have a lower pitch than in the question “Joe went to the store?”
Since the number of examples that can be stored for each speech unit is limited, a stored example may not always match the prosody of its surrounding speech units when it is combined with other units. In addition, the transition between concatenated speech units is sometimes discontinuous because the speech units have been taken from different parts of the training session.
To correct these problems, the prior art has developed techniques for changing the pitch and duration of a stored speech unit so that the speech unit better fits the context in which it is being used. An example of one such prior art technique is the so-called Time-Domain Pitch-Synchronous Overlap-and-Add (TD-PSOLA) technique, which is described in “Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis using Diphones”, E. Moulines and F. Charpentier, Speech Communication, vol. 9, no. 5, pp. 453-467, 1990. Using this technique, the prior art increases the pitch of a speech unit by identifying a section of the speech unit responsible for the pitch. This section is a complex waveform that is a sum of sinusoids at multiples of a fundamental frequency F
0
. The pitch period is defined by the distance between two pitch peaks in the waveform. To increase the pitch, the prior art copies a segment of the complex waveform that is as long as the pitch period. This copied segment is then shifted by some portion of the pitch period and reinserted into the waveform. For example, to double the pitch, the copied segment would be shifted by one-half the pitch period, thereby inserting a new peak half-way between two existing peaks and cutting the pitch period in half.
To lengthen a speech unit, the prior art copies a section of the speech unit and inserts the copy into the complex waveform. In other words, the entire portion of the speech unit after the copied segment is time-shifted by the length of the copied segment so that the duration of the speech unit increases.
Unfortunately, these techniques for modifying the prosody of a speech unit have not produced completely satisfactory results. As such, a new technique is needed for modifying the pitch and duration of speech units during speech synthesis.
SUMMARY OF THE INVENTION
The present invention provides a method for synthesizing speech by modifying the prosody of individual components of a training speech signal and then combining the modified speech segments. The method includes selecting an input speech segment and identifying an output prosody. The prosody of the input speech segment is then changed by independently changing the prosody of a voiced component and an unvoiced component of the input speech signal. These changes produce an output voiced component and an output unvoiced component that are combined to produce an output speech segment. The output speech segment is then combined with other speech segments to form synthesized speech.
In another embodiment of the invention, a time-domain training speech signal is converted into frequency-domain values that are quantized into codewords. The codewords are retrieved based on an input text and are filtered to produce a descriptor function. The filtering limits the rate of change of the descriptor function. Based on the descriptor function, an output set of frequency-domain values are identified, which are then converted into time-domain values representing portions of the synthesized speech.
By filtering the codewords to produce a descriptor function, the present invention is able to reduce the effects of compression error inherent in quantizing the frequency-domain values into codewords and is able to smooth out transitions between and within speech units.
Other aspects of the invention include using the descriptor function to identify frequency-domain values at time marks associated with an output prosody that is different than the input prosody of the training speech signal.
REFERENCES:
patent: 5617507 (1997-04-01), Lee et al.
patent: 5905972 (1999-05-01), Huang et al.
Parsons, TW, Voice and Speech Processing, McGraw Hill, pp. 284-285, Dec. 1987.*
Flanagan et al, Synthetic Voices for Computers, IEEE Spectrum, Oct. 1970.*
A. Acero, “Source-Filter Models for Time-Scale Pitch-Scale Modification of Speech”,IEEE Int. Conf. on Acoustics, Speech, and Signal Procesing, vol. 2, Seattle, pp. 881-884, May 1998.
X. Huang et al., “Recent Improvements on Microsoft's Trainable Text-to-Speech System: Whistler.”,IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, pp. 959-962, Apr. 1997.
W.B. Kleijn et al., “Transformation and Decomposition of the Speech Signal for Coding.”,IEEE Signal processing Letters, vol. 1, No. 9, pp. 136-138, 1994.
E. Moulines et al., “Pitch-synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones.”,Speech Communication, vol. 9, No. 5, pp. 453-467, 1990.
Y. Stylianou et al., “High-Quality Speech Modification based on a Harmonic +Noise Model.”,Proc. of Eurospeech Conference, Madrid, Spain, pp. 451-554, 1995.
Magee Theodore M.
Microsoft Corporation
Sax Robert Louis
Tsang Fan
Westman Champlin & Kelly P.A.
LandOfFree
Method and apparatus for speech synthesis with efficient... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for speech synthesis with efficient..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for speech synthesis with efficient... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2460879