Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
1998-09-11
2001-07-24
Zele, Krista (Department: 2645)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
Reexamination Certificate
active
06266637
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to speech splicing and, more particularly, to a system and method for phrase splicing and variable substitution of speech using a synthesizing device.
2. Description of the Related Art
Speech recognition systems are used in many areas today to transcribe speech into text. The success of this technology in simplifying man-machine interaction is stimulating the use of this technology into a plurality of useful applications, such as transcribing dictation, voicemail, home banking, directory assistance, etc. In particularly useful applications, it is often advantageous to provide synthetic speech generation as well.
Synthetic speech generation is typically performed by utterance playback or full text-to-speech (TTS) synthesis. Recorded utterances provide high speech quality and are typically best suited for applications where the number of sentences to be produced is very small and never changes. However, there are limits to the number of utterances which can be recorded. Expanding the range of recorded utterance systems by playing phrase and word recordings to construct sentences is possible, but does not produce fluent speech and can suffer from serious prosodic problems.
Text-to-speech systems may be used to generate arbitrary speech. They are desirable for some applications, for example where the text to be spoken cannot be known in advance, or where there is simply too much text to prerecord everything. However, speech generated by TTS systems tends to be both less intelligible and less natural than human speech.
Therefore, a need exists for a speech synthesis generation system which provides all the advantages of recorded utterances and text-to-speech synthesis. A further need exists for a system and method capable of blending pre-recorded speech with synthetic speech.
SUMMARY OF THE INVENTION
In accordance with the present invention, a method for providing generation of speech includes the steps of providing input to be acoustically produced, comparing the input to training data to identify one of words and word sequences corresponding to the input for constructing a phone sequence, comparing the input to a pronunciation dictionary when the input is not found in the training data, identifying a segment sequence using a first search algorithm to construct output speech according to the phone sequence and concatenating segments of the segment sequence and modifying characteristics of the segments to be substantially equal to requested characteristics.
In other methods, the characteristics may include at least one of duration, energy and pitch. The step of comparing may include the step of searching the training data using a second search algorithm. The second search algorithm may include a greedy algorithm. The first search algorithm preferably includes a dynamic programming algorithm. The step of outputting synthetic speech is also provided. The method may further include the step of using the first search algorithm, performing a search over the segments in decision tree leaves.
Another method for providing generation of speech includes the steps of providing input to be acoustically produced, comparing the input to application specific splice files to identify one of words and word sequences corresponding to the input for constructing a phone sequence, augmenting a generic segment inventory by adding segments corresponding to the identified words and word sequences, identifying a segment sequence, using a first search algorithm and the augmented generic segment inventory to construct output speech according to the phone sequence and concatenating the segments of the segment sequence and modifying characteristics of the segments of the segment sequence to be substantially equal to requested characteristics.
In particularly useful methods, the characteristics may include at least one of duration, energy and pitch. The step of comparing may include the step of searching the application specific inventory using a second search algorithm and a splice file dictionary. The second search algorithm may include a greedy algorithm. The first search algorithm preferably includes a dynamic programming algorithm. The step of outputting synthetic speech is also provided.
The step of comparing may include the step of comparing the input to a pronunciation dictionary when the input is not found in the splice files. The method may further include the step of by using the first search algorithm, performing a search over the segments in decision tree leaves. The step of identifying may include the steps of bypassing costing of the characteristics of the segments from a splicing inventory against the requested characteristics. The step of identifying may include the step of applying pitch discontinuity costing across the segment sequence. The method may further include the step of selecting segments from a splicing inventory to provide the requested characteristics. The requested characteristics may include pitch and the method may further include the step of selecting segments from the generic segment inventory to provide the requested pitch characteristics. The method may further include the step of applying pitch discontinuity smoothing to the requested pitch characteristics provided by the selected segments from the generic segment inventory.
A system for generating synthetic speech, in accordance with the invention includes means for providing input to be acoustically produced and means for comparing the input to application specific splice files to identify one of words and word sequences corresponding to the input for constructing a phone sequence. Means for augmenting a generic segment inventory by adding segments corresponding to sentences including the identified words and word sequences and a synthesizer for utilizing a first search algorithm and the augmented generic inventory to identify a segment sequence to construct output speech according to the phone sequence are also included. Means for concatenating segments of the segment sequence and modifying characteristics of the segments of the segment sequence to be substantially equal to requested characteristics, is further included.
In alternative embodiments, the generic segment inventory includes pre-recorded speaker data to train a set of decision-tree state-clustered hidden Markov models. The second search algorithm may include a greedy algorithm and a splice file dictionary. The means for comparing may compare the input to a pronunciation dictionary when the input is not found in the splice files. The first search algorithm may perform a search over the segments in decision tree leaves. The means for providing input may include an application specific host system. The application specific host system may include an information delivery system. The first search algorithm may include a dynamic programming algorithm. The comparing means may include a searching algorithm which may include a greedy algorithm and a splice file dictionary. The means for providing input may include an application specific host system which may include an information delivery system.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
REFERENCES:
patent: 4692941 (1987-09-01), Jacks et al.
patent: 4882759 (1989-11-01), Bahl et al.
patent: 5202952 (1993-04-01), Gillick et al.
patent: 5333313 (1994-07-01), Heising
patent: 5384893 (1995-01-01), Hutchins
patent: 5502791 (1996-03-01), Nishimura et al.
patent: 5513298 (1996-04-01), Stanford et al.
patent: 5526463 (1996-06-01), Gillick et al.
patent: 5706397 (1998-01-01), Chow
patent: 5839105 (1998-11-01), Ostendorf et al.
patent: 5884261 (1999-03-01), DeSouza et al.
patent: 5937385 (1999-08-01), Zarozny et al.
patent: 5983180 (1999-11-01), Robinson
patent: 6032111 (2000-02-01), Mohri
patent: 6038533 (2000-03-01), Buchsbaum et al.
E-Speech we
Donovan Robert E.
Franz Martin
Roukos Salim E.
Sorensen Jeffrey
F. Chau & Associates LLP
International Business Machines - Corporation
Opsasnick Michael N.
Zele Krista
LandOfFree
Phrase splicing and variable substitution using a trainable... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Phrase splicing and variable substitution using a trainable..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Phrase splicing and variable substitution using a trainable... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2564602