Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
2000-06-29
2003-01-21
McFadden, Susan (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
C704S260000, C704S270000
Reexamination Certificate
active
06510413
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to the generation of synthetic speech, and more specifically, to the generation of synthetic speech at remote client devices.
2. Description of Related Art
Speech synthesis, which refers to the artificial generation of speech from written text, is increasingly becoming an important technology for accessing information. Two areas in which speech synthesis looks particularly promising is in increasing the availability of information to sight-impaired individuals and in enriching the information content of web-based devices that have minimal or no viewing screens.
FIG. 1
is a diagram illustrating a conventional web-based speech synthesis system. Synthesizing text
101
into a digital waveform file
110
is performed by the three sequential steps of text analysis
102
, prosodic analysis
103
, and speech waveform generation
104
.
In textual analysis, text
101
is analyzed into some form of linguistic representation. The analyzed text is next decomposed into sounds, more generally described as acoustic units. Most of the acoustic units for languages like English are obtained from a pronunciation dictionary. Other acoustic units corresponding to words not in the dictionary are generated by letter-to-sound rules for each language. The symbols representing acoustic units produced by the dictionary and letter-to-sound rules typically correspond to phonemes or syllables in a particular language.
Prosodic analysis
103
includes the identification of points within sentences that require changes in the intonation or pitch contour (up, down, flattening) and the defining of durations for certain syllabes. The pitch contour may be further refined by segmenting the current sentence into intonational phrases. Intonational phrases are sections of speech characterized by a distinctive pitch contour, which usually declines at the end of each phrase.
The speech waveform generation section
104
receives the acoustic sequence specification of the original sentence from the prosodic analysis section
103
, and generates a human sounding digital audio waveform (waveform file
110
). The speech waveform generation section
104
may generate an audible signal by employing a model of the vocal tract to produce a base waveform that is modulated according to the acoustic sequence specification to produce a digital audio waveform file. Another known method of generating an audible signal is through the concatenation of small portions of pre-recorded digital audio. These digital audio units are typically obtained by recording utterances from a human speaker. The series of concatenated units is then modulated according to the parameters of the acoustic sequence specification to produce an output digital audio waveform file. In most cases, the concatenated digital audio units will have a one-to-one correspondence to the acoustic units in the acoustic sequence specification. The resulting digital audio waveform file
110
may be rendered into audio by converting it into an analog signal, and then transmitting the analog signal to a speaker.
In the context of a web-based application, text
101
may be specifically designated by a web-page designer as text that viewers of the web site can hear as well as read. There are several methods that may be utilized to prepare a portion of web text for rendering into speech in the form a digital audio waveform. A human speaker may read aloud text into a collection of digital audio recordings. A remote client can then download and listen to the digital audio files corresponding to selected portions of the text. In another approach, a web-page author may elect to perform the steps of text analysis
102
, prosodic analysis
103
, and speech waveform generation
104
, for each portion of text, producing a collection of digital audio files that could be stored on the web-server, and then transferred on request to the remote client.
An advantage of the above techniques is that rendering the binary speech waveform file
110
into audio at the client is a simple process that requires very little client resources. The digital audio files can be rendered into audio on web-access devices possessing minimal amounts of computer memory and little if any computational power. A disadvantage, however, is that digital audio files corresponding to speech waveforms
110
tend to be large files that require a lot of network bandwidth. This can be particularly problematic for clients connected to network
115
using a relatively slow connection such as a dial-up modem or a wireless cell-modem connection.
Another conventional speech synthesis technique for generating synthesized speech at a client computer is implemented using a process similar to that shown in
FIG. 1
, with the exception that text analysis section
102
, prosodic analysis section
103
, and speech waveform generation section
104
are all located locally at the client. In operation, text
101
is transmitted over the network to the client, and all the speech synthesis steps are then performed locally. A problem associated with this method of speech synthesis is that it can be computationally burdensome to the client. Additionally, programs for performing textual analysis, prosodic analysis, and speech waveform generation may be large programs containing extensive look-up dictionaries. Such programs are not suitable for web-terminals or for small portable browsers such as those incorporated into cellular phones or personal digital assistant (PDA) devices.
Accordingly, there is a need in the art to be able to efficiently deliver and synthesize speech at client devices, especially when the client devices have limited processing ability and low bandwidth connections.
REFERENCES:
patent: 5572625 (1996-11-01), Raman et al.
patent: 5911129 (1999-06-01), Towell
patent: 5915237 (1999-06-01), Boss et al.
patent: 5933805 (1999-08-01), Boss et al.
patent: 6085161 (2000-07-01), MacKenty et al.
patent: 6101470 (2000-08-01), Eide et al.
patent: 6163794 (2000-12-01), Lange et al.
patent: 6226606 (2001-05-01), Acero et al.
patent: 6226614 (2001-05-01), Mizuno et al.
patent: 6233550 (2001-05-01), Gersho et al.
patent: 6246672 (2001-06-01), Lumelsky
patent: 6253151 (2001-06-01), Ohler et al.
patent: 6289304 (2001-09-01), Grefenstette
patent: 6314367 (2001-11-01), Ohler et al.
patent: WO 99/66496 (1999-12-01), None
Intel Corporation
McFadden Susan
Pillsbury & Winthrop LLP
LandOfFree
Distributed synthetic speech generation does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Distributed synthetic speech generation, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Distributed synthetic speech generation will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3020678