Client/server architecture for text-to-speech synthesis

Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S258000, C704S261000

Reexamination Certificate

active

06810379

ABSTRACT:

FIELD OF THE INVENTION
This invention relates generally to text-to-speech synthesis. More particularly, it relates to a client/server architecture for very high quality and efficient text-to-speech synthesis.
BACKGROUND ART
Text-to-speech (TTS) synthesis systems are useful in a wide variety of applications such as automated information services, auto-attendants, avatars, computer-based instruction, and computer systems for the vision impaired. An ideal system converts a piece of text into high-quality, natural-sounding speech in near real time. Producing high-quality speech requires a large number of potential acoustic units and complex rules and exceptions for combining the units, i.e., large storage capability and high computational power. A prior art text-to-speech system
10
is shown schematically in FIG.
1
. An original piece of text is converted to speech by a number of processing modules. The input text specification usually contains punctuation, abbreviations, acronyms, and non-word symbols. A text normalization unit
12
converts the input text to a normalized text containing a sequence of non-abbreviated words only. Most punctuation is useful in suggesting appropriate prosody, and so the text normalization unit
12
filters out punctuation to be used as input to a prosody generation unit
16
. Other punctuation is extraneous and filtered out completely. Abbreviations and acronyms are converted to their equivalent word sequences, which may or may not depend on context. The most complex task of the text normalization unit
12
is to convert symbols to word sequences. For example, numbers, currency amounts, dates, times, and email addresses are detected, classified, and then converted to text that depends on the symbol's position in the sentence. The normalized text is sent to a pronunciation unit
14
that first analyzes each word to determine its simplest morphological representation. This is trivial in English, but in a language in which words are strung together (e.g., German), words must be divided into base words and prefixes and suffixes. The resulting words are then converted to a phoneme sequence or its pronunciation. The pronunciation may depend on a word's position in a sentence or its context (i.e., the surrounding words). Three resources are used by the pronunciation unit
14
to perform conversion: letter-to-sound rules; statistical representations that convert letter sequences to most probable phoneme sequences based on language statistics; and dictionaries, which are simple word/pronunciation pairs. Conversion can be performed without statistical representations, but all three resources are preferably used. Rules can distinguish between different pronunciations of the same word depending on its context. Other rules are used to predict pronunciations of unseen letter combinations based on human knowledge. Dictionaries contain exceptions that cannot be generated from rules or statistical methods. The collection of rules, statistical models, and dictionary forms the database needed for the pronunciation unit
14
. This database is usually quite large in size, particularly for high-quality text-to-speech conversion.
The resulting phonemes are sent to the prosody generation unit
16
, along with punctuation extracted from the text normalization unit
12
. The prosody generation unit
16
produces the timing and pitch information needed for speech synthesis from sentence structure, punctuation, specific words, and surrounding sentences of the text. In the simplest case, pitch begins at one level and decreases toward the end of a sentence. The pitch contour can also be varied around this mean trajectory. Dates, times, and currencies are examples of parts of a sentence that are identified as special pieces; the pitch of each is determined from a rule set or statistical model that is crafted for that type of information. For example, the final number in a number sequence is almost always at a lower pitch than the preceding numbers. The rhythms, or phoneme durations, of a date and a phone number are typically different from each other. Usually a rule set or statistical model determines the phoneme durations based on the actual word, its part of the sentence, and the surrounding sentences. These rule sets or statistical models form the database needed for this module; for the more natural sounding synthesizers, this database is also quite large.
The final unit, an acoustic signal synthesis unit
18
, combines the pitch, duration and phoneme information from the pronunciation unit
14
and the prosody generation unit
16
to produce the actual acoustic signal. There are two dominant methods in state of the art speech synthesizers. The first is formant synthesis, in which a human vocal track is modeled and phonemes are synthesized by producing the necessary formants. Formant synthesizers are very small, but the acoustic quality is insufficient for most applications. The more widely used high-quality synthesis technique is concatenative synthesis, in which a voice artist is recorded to produce a database of sub-phonetic, phonetic, and larger multi-phonetic units. Concatenative synthesis is a two-step process: deciding which sequence of units to use, and concatenating them in such a way that duration and pitch are modified to obtain the desired prosody. The quality of such a system is usually proportional to the size of the phonetic unit database.
A high quality text-to-speech synthesis system thus requires large pronunciation, prosody, and phonetic unit databases. While it is certainly possible to create and efficiently search such large databases, it is much less feasible for a single user to own and maintain such databases. One solution is to provide a text-to-speech system at a server machine and available to a number of client machines over a computer network. For example, the clients provide the system with a piece of text, and the server transmits the converted speech signal to the user. Standard speech coders can be used to decrease the amount of data transmitted to the client.
One problem with such a system is that the quality of speech eventually produced at the client depends on the amount of data transmitted from the server. Unless an unusually high bandwidth connection is available between the server and the client, the connection is such that an unacceptably long delay is required to receive data producing high quality sound at the client. For typical client applications, the amount of data transmitted must be reduced so that the communication traffic is at an acceptable level. This data reduction is necessarily accompanied by approximations and loss of speech quality. The client/server connection is therefore the limiting factor in determining speech quality, and the high-quality speech synthesis at the server is not fully exploited.
U.S. Pat. No. 5,940,796, issued to Matsumoto, provides a speech synthesis client/server system. A voice synthesizing server generates a voice waveform based on data sent from the client, encodes the waveform, and sends it to the client. The client then receives the encoded waveform, decodes it, and outputs it as voice. There are a number of problems with the Matsumoto system. First, it uses signal synthesis methods such as formant synthesis, in which a human vocal track is modeled according to particular parameters. The acoustic quality of formant synthesizers is insufficient for most applications. Second, the Matsumoto system uses standard speech compression algorithms for compressing the generated waveforms. While these algorithms do reduce the data rate, they still suffer the quality/speed tradeoff mentioned above for standard speech coders. Generic speech coders are designed for the transmission of unknown speech, resulting in adequate acoustic quality and graceful degradation in the presence of transmission noise. The design criteria are somewhat different for a text-to-speech system in which a pleasant sounding voice (i.e., higher than adequate acoustic quality) is desired, the speech is known beforehand,

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Client/server architecture for text-to-speech synthesis does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Client/server architecture for text-to-speech synthesis, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Client/server architecture for text-to-speech synthesis will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3269183

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.