Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
1998-09-14
2001-10-23
Tsang, Fan (Department: 2645)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
C704S251000, C704S254000, C704S258000
Reexamination Certificate
active
06308156
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The invention relates to a digital speech-synthesis process.
2. The Prior Art
Three processes are essentially known in the synthetic generation of speech with computers.
In formant synthesis, the resonance properties of the human vocal tract and its variations in the course of speaking, which are caused by the movements of the speech organs, are simulated by a filtered excitation signal. Such resonances are characteristic of the structure of vowels and their perception. For limiting the computing expenditure, the first three to five formants of a speech are generated synthetically with the excitation source. Therefore, with this type of synthesis, the memory location requirements in a computer are low. Furthermore, a simple change can be realized in duration and in the fundamental of the rule set excitation waveforms. However, the drawback is that an extensive rule set is needed for speech synthesis output, which often requires the use of digital processors. Furthermore, it is a disadvantage that the speech output sounds unnatural and metallic, and that it has special weak points in connection with nasals and obstruents, i.e., with plosives /p, t, k, b, d, g/, affricates /pf, ts, tS/ and fricatives /f, v, s, z, S, Z, C, j, x, h/.
In the present text, the letters shown between slashes (//) represent sound symbols according to the SAMPA-notation; cf: Wells, J.; Barry, W. J.; Grice, M.; Fourcin, A.; Gibbon, D. [1992]; Standard Computer-Compatible Transcription, in: ESPRIT PROJECT 2589 (SAM); Multi-Lingual Speech Input/ Output Assessment, Methodology and Standardization; Final Report; Doc. SAM-UCL-037, pp 29 ff.
In articulatory synthesis, the acoustic conditions in the vocal tract are modeled, so that the articulatory gestures and movements during speaking are simulated mathematically. Thus an acoustic model of the vocal tract is computed, which leads to substantial computing expenditure and which requires a high computing capacity. However, the automatic speech generated this way still sounds unnatural and technical.
Furthermore, the concatenation synthesis is known, where parts of really spoken utterances are concatenated in such a way that new utterances are generated. The individual speech segments thus form units for the generation of speech. The size of the segments may reach from words and phrases up to parts of sounds depending on the field of application. Demi-syllables or smaller demi-units can be used for speech synthesis with an unlimited vocabulary. Larger units are useful only if a limited vocabulary is to be synthesized.
In systems which do not use resynthesis, the choice of the correct cutting point of the speech components is decisive for the synthesis quality, and melodic and spectral jumps have to be avoided. Concatenative synthesis processes then achieve—especially with larger units—a more natural sound than the other methods. Furthermore, the controlling expenditure for generating the sounds is quite low. The limitations of this process lie in the relatively high memory requirements for the speech components needed. Another limitation of this process is that components, once recorded, can be changed (e.g. in duration or frequency) in the known systems only by costly resynthesis methods, which, furthermore, have an adverse effect on the sound of the speech and its comprehensibility. For this reason, also a number of different realizations of a speech unit are recorded, which, however, increases memory requirements.
The concatenation synthesis processes essentially comprise four synthesis methods permitting the speech synthesis without limitation of the vocabulary.
A concatenation of sounds or phones is carried out in phone synthesis. For Western European languages with a sound inventory of about 30 to 50 sounds and an average sound duration of about 150 ins, the memory requirements are acceptably low. However, these speech signal units lack the perceptively important transitions between the individual sounds, which, furthermore, can be recreated only incompletely by fading over individual sounds or even more complicated resynthesis methods. The quality of synthesis is, therefore, not satisfactory. Even storing allophonic variants of sounds in separate speech signal units in the so called allophone synthesis does not significantly enhance the speech result due to disregard of the articulatory-acoustic dynamics.
The most widely applied form of concatenation synthesis is the diphone synthesis, which employs speech signal units reaching from the middle of an acoustically defined speech sound up to the middle of the next speech sound. The perceptually important transitions from one sound to the next are taken into account in this way, such transitions appearing in the acoustic signal as a result of the movements of the speech organs. Furthermore, the speech signal units are thus concatenated at spectrally relatively constant places, which reduces the potentially present interferences of the signal flow on the joints of the individual diphones. The sound inventory of Western European languages consists of 35 to 50 sounds. For a language with 40 sounds, this theoretically results in 1600 pairs of diphones, which are then really reduced to about 1000 by phonotactic constraints. In natural speech, unstressed and stressed sounds differ in sound quality and duration. Different diphones are recorded in some systems for stressed and unstressed sound pairs in order to adequately take said differences into account in the synthesis. Therefore, 1000 to 2000 diphones with an average duration of about 150 ms are required depending on the projected configuration, resulting in a memory requirement for the speech signal units of up to 23 MB depending on the requirements with respect to dynamics and signal bandwidth. A common value amounts to approximately 8 MB.
The triphone and the demi-syllable syntheses are based on a principle similar to the one of the diphone synthesis. In this case too, the cutting point is disposed in the middle of the sounds. However, larger units are covered, which permits taking into account larger phonetic contexts. However, the number of combinations increases proportionally. In demi-syllable synthesis, one cutting point for the units used is in the middle of the vowel of a syllable. The other cutting point is at the beginning or at the end of a syllable, so that depending on the syllable structure, speech signal units can consist of sequences of several consonants. In German, about 52 different sound sequences exist in starting syllables of morphemes, and about 120 sound sequences for medial or final syllables of morphemes, resulting in a theoretical number of 6240 demi-syllables for the German language, of which some are uncommon. As demi-syllables are mostly longer than diphones, the memory requirements for speech signal units exceed those with diphones considerably.
The substantial memory requirements therefore pose the greatest problem in connection with a high-quality speech synthesis system. For reducing these requirements, it has been proposed, for example to exploit the silence in the closure of plosives for all plosive closures. A speech synthesis system is known from EP 0 144 731 Bl, where segments of diphones are used for several sounds. Said document describes a speech synthesizer which stores speech signal units which are generated by dividing a pair of sounds and relates such units with defined expression symbols. A synthesizing device reads the standard speech signal units from the memory in accordance with the output symbols of the converted sequence of expression symbols. Based on the speech segment of the input symbols it is determined whether two read standard speech signal units are connected directly when the input speech segment of the input symbols is voiceless, or whether a preset first interpolation process is applied when the input speech segment of the input symbol is voiced, the same standard signal unit being used both for a voiced sound /g, d, b/ and for its co
Barry William
Benzmüller Ralf
Luning Andreas
Collard & Roe P.C.
G Data Software GmbH
Opsasnick Michael N.
Tsang Fan
LandOfFree
Microsegment-based speech-synthesis process does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Microsegment-based speech-synthesis process, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Microsegment-based speech-synthesis process will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2614058