Text selection and recording by feedback and adaptation for...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06792407

ABSTRACT:

BACKGROUND AND SUMMARY OF THE INVENTION
The present invention relates generally to text-to-speech synthesis. More particularly, the invention relates to a method for personalizing a synthesizer and for developing a database of speech units for use by a text-to-speech synthesizer.
Text-to-speech synthesis systems convert an input string of text into synthesized speech using speech modeling parameters or digitally sampled concatenative sound units to generate data strings that are played back through an audio system to mimic the sound of human speech. The model parameters or concatenative units are usually developed or trained in advance using recordings of actual human speech as the starting point. The model parameters or concatenative units, however, allow a very limited mimic of the sound of human speech based on the training which typically utilizes recordings from one individual.
Developing a sufficiently rich body of spoken text can be very time-consuming and expensive. Examples of actual human speech need to be recorded and labeled; and the resulting set of recordings needs to include at least one instance of every speech unit type needed for synthesis of all attested phoneme strings in the target language. This means, for example, that in a diphone synthesizer, the database must contain recorded examples of every allowed sequence of two allophones. Because data collection and analysis involves significant labor, it is desirable to minimize the size of the database. Ideally this means that one wants to collect the smallest set of utterances containing the desired material. However, in planning the recording sessions it is also necessary to consider other factors. Many unit types may contain different pronunciations, based on phonemes adjacent to the ones they contain. If the resulting synthesizer is to reproduce these effects, then all such variants must be attested.
For example, in the English language the diphone sequence /kae/ is pronounced differently in “cat” than in “can”, due to the nasalizing effects of the following
/ in the latter word. A high quality synthesizer must contain examples of both types of /kae/.
In addition to variations due to adjacent phonemes, other variations may be attributed to syllable boundaries and word boundaries. Moreover, some contexts may simply produce better sound units than others. For example, sound units taken from secondary stressed syllables can be used to synthesize both secondary and primary stressed syllables. The converse is not necessarily true. Thus sound units taken from context which have primary stress in the original utterance may only be useable for synthesizing syllables which also have primary stress. Finally, synthesis developers may find that certain types of utterances produce better sound units than others. For example, when human speakers read simple words in isolation, the recordings often do not produce good sound units for synthesis. Similarly, very long sentences may also be problematic. Therefore complex words and short phrases are preferred.
The task of assembling a collection of suitable text words and phrases for use in a synthesis database recording session has heretofore been daunting, to say the least. Most developers will compile a collection of sentences and words for the preselected speakers to read and this collection is usually quite a bit larger than would actually be needed if one analyzed the text requirements in a systematic way. The result of collecting suitable text words and phrases based on preselected speakers is a limited ability to produce the synthesized speech. Although the synthesized speech mimics the sound of human speech, the range of qualities of the sound is limited to a great extent depending on the speakers. Most synthesis system designers have approached the problem more as an art than as a science and that yields a limited ability to produce mimicked speech personalized to sound similar to a particular human.
The present invention seeks to formalize the development of recorded content for text-to-speech synthesis through a set of procedures which, if followed, produce a minimal recording text list which contains all necessary unit types for a given language, with all desired variants of each, from optimal contexts in optimal types of utterances. The invention further seeks to personalize the synthesized speech to more closely mimic a particular speaker based on the minimal recording text list.
The personalizer represents one important aspect of the invention in which an original set of recorded sound units, stored as allophones, diphones and/or triphones (generally referred to here as snippets) in a database, are compared with the sound units of a new speaker or target speaker. In a preferred embodiment, allophones from different contexts are compared with allophones from the original set of recorded sound units. This is done by acoustic alignment of the respective allophones, followed by a closeness comparison. The closeness comparison may be performed using the same components as are used for automatic speech recognition.
When the comparison is performed, some allophones from the recorded set and from the new speaker will be sufficiently close, acoustically, so that no modification of those allophones is required. However, other allophones may differ substantially between the originally recorded set and the new target speaker. The personalizer employs a threshold comparison system to separate the allophones that are acoustically close from those that are not. The personalizer then focuses on the allophones that are not acoustically close. These “far” allophones will be altered to make the synthesizer sound more like the target speaker.
The set of “far” allophones can be compared against a source of text using an exhaustive search algorithm, to identify all passages of text that contain representative examples of the “far” allophones. However, the presently preferred embodiment uses a greedy selection algorithm to identify passages of text that best represent the “far” allophones. The greedy selection algorithm thus generates a customized training text which the target speaker then reads while the system captures examples of that speaker's “far” allophones. Once examples of the “far” allophones have been collected, they are substituted for those of the original set, or are otherwise used to transform the sound units used by the synthesizer, so that the synthesizer will now sound like the target speaker.
The target speaker utters each allophone in a given context, such as a neutral context (e.g. the vowel surrounded by letters ‘t’ or ‘s’). Using knowledge of the target speaker's allophones in this given context, the system determines which allophones are “far” from those of the synthesizer. While it is possible to simply substitute these known “far” allophones for those of the synthesizer, there typically will remain many other contexts of that allophone for which the system has no uttered data from the target speaker. Therefore, to develop a richer representation of the target speaker's allophones, the system determines what additional contexts or environments are needed to develop a complete assessment of the allophone in question and generates additional text for the target speaker to read. The generated text is specifically designed using the greedy algorithm to optimally obtain examples of the allophones in question from other contexts. In this way the “far” allophones may be pulled closer to those of the target speaker across all contexts.
The additional contexts are selected by rules designed to group or cluster contexts into related classes. In designing the system, related classes of contexts are determined by analyzing the data from the original synthesizer and then making the assumption that all speakers (including the target speaker) would have the same classes. For example, the data may show that the letter ‘a’ in the context of adjacent fricatives will all behave in acoustically the same way and would thus be clustered together. To do this a

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Text selection and recording by feedback and adaptation for... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Text selection and recording by feedback and adaptation for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Text selection and recording by feedback and adaptation for... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3204494

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.