Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
1999-04-14
2001-05-22
Dorvil, Richemond (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
Reexamination Certificate
active
06236966
ABSTRACT:
FIELD OF THE INVENTION
This invention relates to the field of audio synthesis, and in particular to systems and methods for generating control parameters for audio synthesis.
BACKGROUND OF THE INVENTION
The field of sound synthesis, and in particular speech synthesis, has received less attention historically than fields such as speech recognition. This may be because early in the research process, the problem of generating intelligible speech was solved, while the problem of recognition is only now being solved. However, these traditional speech synthesis solutions still suffer from many disadvantages. For example, conventional speech synthesis systems are difficult and tiring to listen to, can garble the meaning of an utterance, are inflexible, unchanging, unnatural-sounding and generally ‘robotic’ sounding. These disadvantages stem from difficulties in reproducing or generating the subtle changes in pitch, cadence (segmental duration), and other vocal qualities (often referred to as prosodics) which characterize natural speech. The same is true of the transitions between speech segments themselves (formants, diphones, LPC parameters, etc.).
The traditional approaches in the art to generating these subtler qualities of speech tend to operate under the assumption that the small variations in quantities such as pitch and duration observed in natural human speech are just noise and can be discarded. As a result, these approaches have primarily used inflexible methods involving fixed formulas, rules and the concatenation of a relatively small set of prefigured geometric contour segments. These approaches thus eliminate or ignore what might be referred to as microprosody and other microvariations within small pieces of speech.
Recently, the art has seen some attempts to use learning machines to create more flexible systems which respond more reasonably to context and which generate somewhat more complex and evolving parameter (e.g., pitch) contours. For example, U.S. Pat. No. 5,668,926 issued to Karaali et al. describes such a system. However, these approaches are also flawed. First, they organize their learning architecture around fixed-width time slices, typically on the order of 10 ms per time slice. These fixed time segments, however, are not inherently or meaningfully related to speech or text. Second, they have difficulty making use of the context of any particular element of the speech: what context is present is represented at the same level as the fixed time slices, severely limiting the effective width of context that can be used at one time. Similarly, different levels of context are confused, making it difficult to exploit the strengths of each. Additionally, by marrying context to fixed-width time slices, the learning engine is not presented with a stable number of symbolic elements (e.g., phonemes or words.) over different patterns.
Finally, none of these models from the prior art attempt application of learning models to non-verbal sound modulation and generation, such as musical phrasing, non-lexical vocalizations, etc. Nor do they address the modulation and generation of emotional speech, voice quality variation (whisper, shout, gravelly, accent), etc.
SUMMARY OF THE INVENTION
In view of the above, it is an object of the present invention to provide a system and method for the production of prosodics and other audio control parameters from meaningful symbolic representations of desired sounds. Another object of the invention is to provide such a technique that avoids problems associated with using fixed-time-length segments to represent information at the input of the learning machine. It is yet another object of the invention to provide such a system that takes into account contextual information and multiple levels of abstraction.
Another object of the invention is to provide a system for the production of audio control parameters which has the ability to produce a wide variety of outputs. Thus, an object is to provide such a system that is capable of producing all necessary parameters for sound generation, or can specialize in producing a subset of these parameters, augmenting or being augmented by other systems which produce the remaining parameters. In other words, it is an object of the invention to provide an audio control parameter generation system that maintains a flexibility of application as well as of operation. It is a further object of the invention to provide a system and method for the production of audio control parameters for not only speech synthesis, but for many different types of sounds, such as music, backchannel and non-lexical vocalizations.
In one aspect of the invention, a method implemented on a computational learning machine is provided for producing audio control parameters from symbolic representations of desired sounds. The method comprises presenting symbols to multiple input windows of the learning machine. The multiple input windows comprise at least a lowest window and a higher window. The symbols presented to the lowest window represent audio information having a low level of abstraction, such as phonemes, and the symbols presented to the higher window represent audio information having a higher level of abstraction, such as words. The method further includes generating parameter contours and temporal scaling parameters from the symbols presented to the multiple input windows, and then temporally scaling the parameter contours in accordance with the temporal scaling parameters to produce the audio control parameters. In a preferred embodiment, the symbols presented to the multiple input windows represent sounds having various durations. In addition, the step of presenting the symbols to the multiple input windows comprises coordinating presentation of symbols to the lowest level window with presentation of symbols to the higher level window. The coordinating is performed such that a symbol in focus within the lowest level window is contained within a symbol in focus within the higher level window. The audio control parameters produced represent prosodic information pertaining to the desired sounds.
Depending on the application, the method may involve symbols representing lexical utterances, symbols representing non-lexical vocalizations, or symbols representing musical sounds. Some examples of symbols are symbols representing diphones, demisyllables, phonemes, syllables, words, clauses, phrases, sentences, paragraphs, emotional content, tempos, time-signatures, accents, durations, timbres, phrasings, or pitches. The audio control parameters may contain amplitude information, pitch information, phoneme durations, or phoneme pitch contours. Those skilled in the art will appreciate that these examples are illustrative only, and that many other symbols can be used with the techniques of the present invention.
In another aspect of the invention, a method is provided for training a learning machine to produce audio control parameters from symbolic representations of desired sounds. The method includes presenting symbols to multiple input windows of the learning machine, where the multiple input windows comprise a lowest window and a higher window, where symbols presented to the lowest window represent audio information having a low level of abstraction, and where the symbols presented to the higher window represent audio information having a higher level of abstraction. The method also includes generating audio control parameters from outputs of the learning machine, and adjusting the learning machine to reduce a difference between the generated audio control parameters and corresponding parameters of the desired sounds.
These and other advantageous aspects of the present invention will become apparent from the following description and associated drawings.
REFERENCES:
patent: 5924066 (1999-07-01), Kundu
patent: 5940797 (1999-08-01), Abe
patent: 6019607 (2000-02-01), Jenkins et al.
Dorvil Richemond
Lumen Intellectual Property Services Inc.
Wieland Susan
LandOfFree
System and method for production of audio control parameters... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for production of audio control parameters..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for production of audio control parameters... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2549110