Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
2001-03-15
2003-01-28
Dorvil, Richemond (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
C704S268000
Reexamination Certificate
active
06513008
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates generally to speech synthesis. More particularly, the present invention relates to a speech synthesizer customization system that is able to override speech synthesis data at all hierarchical levels of a dynamic data structure.
2. Discussion
As the quality of the output of speech synthesizers continues to increase, more and more applications are beginning to incorporate synthesis technologies. For example, car navigation systems, as well as devices for the vision impaired are beginning to incorporate speech synthesizers. As the, popularity of speech synthesis increases, however, a number of limitations with regard to conventional approaches have become apparent.
A particular difficulty relates to the fact that size and development cost considerations limit the vocabulary with which conventional synthesizers are able to deal. Briefly,
FIGS. 1 and 2
illustrate that the typical synthesizer will have a dynamic data structure with hierarchical levels, wherein the dynamic data structure includes a linguistic tree
20
and an acoustic tree
22
. The linguistic tree
20
typically contains syntactic and linguistic objects for the sentence being synthesized, while the acoustic tree
22
holds prosodic and acoustic objects for that sentence. Thus, during synthesis of a sentence, the two hierarchical tree-like structures are “built up” (or populated) based on the input text. It will be appreciated that usually, a tree has nodes such that a “parent” node has “branches” to each of its “child” nodes. The linguistic tree
20
and the acoustic tree
22
are referred to as tree-like structures because, here, a parent node only has access to the first child and last child, while the rest of the children are contained in a list. Furthermore, each child has access to the corresponding parent. Nevertheless, the levels of the tree structures constitute a hierarchy.
The above tree structures and node information for a particular sentence are built up in real time by various synthesis modules, with the assistance of a fixed (or standard) database. For example, a parsing module typically generates clauses and phrases from the sentence being synthesized, while a phoneticizer uses the standard database to build up morphs and phonemes from the words in the sentence. Syllabification and allophone rules contained in the standard database generate syllables and allophones from words, morphs, and phonemes. Prosody algorithms generate prosodic phrases, prosodic words, etc. from all previous information.
As shown in
FIG. 3
, the standard database
24
typically therefore contains tables with information to be placed in the nodes of the trees
20
,
22
. This is especially true for contemporary “concatenation synthesis”. It should be noted that the standard database
24
is also naturally hierarchical, since the data stored in the standard database
24
is intended to supply information for various level nodes in the dynamic trees
20
,
22
. Furthermore, data at higher levels of the database
24
may refer to lower level data (or vice versa). For example, information about a certain kind of phrase may refer to sequences of words and their corresponding dictionary information below. In this manner, data is shared (and memory conserved) by possible multiple references to the same data item. Roughly speaking, the standard database
24
is a relational database.
It is important to note that the above-described database
24
is designed for general unlimited synthesis, and has significant space and development cost problems. Because of these normal limitations, the size and complexity of the database
24
is typically limited. As a result, in order to tailor a given synthesizer to a particular application, it has been found that a user database is often necessary. In fact, synthesizers routinely provide “user dictionaries” which are loaded into the synthesizer and are application specific. Often, markup languages allow commands to be embedded in the input text in order to alter the synthesized speech from the standard result. For example, one approach involves inserting high and low tone marks (including numeric values), into the text to indicate where, and how much to raise an intonation peak.
While the above-described conventional approaches to user databases are useful in some circumstances, a number of difficulties remain. For example, the subsequently generated speech synthesis data cannot be uniformly overridden at all hierarchical levels of the dynamic data structure. Rather, the conventional synthesizer deals with a maximum of one or two hierarchical levels, and each with different mechanisms. Furthermore, some of the hierarchical levels (such as diphone) are essentially inaccessible to text markup due to the inability to achieve the required level of granularity in linear text.
It is also important to note that conventional user database approaches are not able to override speech synthesis data within the normal synthesis sequence of computation. Imagine, for example, that we want to specify a new user supplied diphone A-B, but only if the requested stress level on A is 2 and certain kinds of allophones are found in the surrounding context of what is to be synthesized. It will be appreciated that certain conditions are only known after a complex set of allophone rules are applied (thus determining the allophone stream) and after a prosody module has selected words to de-emphasize, which in turn affects the stress level on a given phoneme. Under conventional approaches, this conditional information cannot practically be known in advance of synthesis. It is therefore virtually impossible to automatically “markup” the input text at every place where the customized diphone should be used. Simply put, user defined conditions cannot currently be based on internal states of the synthesis process, and are therefore severely limited under the traditional text markup process.
Another concern is that conventional user databases are typically not organized around the same hierarchical levels as the dynamic data structures and therefore provide inflexible control over where and what is modified during the synthesis.
The above and other objectives are provided by a speech synthesizer customization system in accordance with the present invention. The customization system has a template management tool for generating templates based on customization data from a user and replicated dynamic synthesis data from a text-to-speech (TTS) synthesizer. The replicated dynamic synthesis data is arranged in a dynamic data structure having hierarchical levels. The customization system further includes a user database that supplements a standard database of the synthesizer. The tool populates the user database with the templates such that the templates enable the user database to uniformly override subsequently generated speech synthesis data at all hierarchical levels of the dynamic data structure. The use of a tool therefore provides a mechanism for organizing, tuning, and maintaining hierarchical and multi-dimensionally sparse sets of user templates. Furthermore, providing a mechanism for uniformly overriding speech synthesis data reduces processing overhead and provides a more “natural” user database.
Further in accordance with the present invention, a user database is provided. The user database has a plurality of templates for overriding speech synthesis data of a TTS synthesizer. The speech synthesis data is arranged in a dynamic data structure having hierarchical levels. The user database further includes a hierarchical data structure organizing the templates such that the templates enable the user database to uniformly override subsequent generated speech synthesis data at all hierarchical levels of the dynamic data structure.
In another aspect of the invention, a method for customizing a synthesizer is provided. The method includes the step of generating templates based on customization data from a user and associated replicated dynamic synthesis
Junqua Jean-Claude
Pearson Steve
Veprek Peter
Dorvil Richemond
Harness Dickey & Pierce PLC
Matsushita Electric - Industrial Co., Ltd.
LandOfFree
Method and tool for customization of speech synthesizer... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and tool for customization of speech synthesizer..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and tool for customization of speech synthesizer... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3059714