Text-based speech synthesis method containing synthetic...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Application

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S260000, C704S270000

Reexamination Certificate

active

06546369

ABSTRACT:

FIELD OF THE INVENTION
The invention relates to the improvement of voice-controlled systems with text-based speech synthesis, in particular with the improvement of the synthetic reproduction of a stored trail of characters whose pronunciation is subject to certain peculiarities.
BACKGROUND OF THE INVENTION
The use of speech to operate technical devices is becoming increasingly important. This applies to data and command input as well as to message output. Systems that utilize acoustic signals in the form of speech to facilitate communication between users and machines in both directions are called voice response systems. The utterances output by such systems can be prerecorded natural speech or synthetically created speech, which is the subject of the invention described in this document. There are also devices known in which such utterances are combinations of synthetic and prerecorded natural language.
A few general explanations and definitions of speech synthesis will be provided in the following to gain a better understanding of the invention.
The object of speech synthesis is the machine transformation of the symbolic representation of an utterance into an acoustic signal that is sufficiently similar to human speech that it will be recognized as such by a human.
Systems used in the field of speech synthesis are divided into two categories:
1) A speech synthesis system produces spoken language based on a given text.
2) A speech synthesizer produces speech based on certain control parameters.
The speech synthesizer therefore represents the last stage of a speech synthesis system.
A speech synthesis technique is a technique that allows you to build a speech synthesizer. Examples of speech synthesis techniques are direct synthesis, synthesis using a model and the simulation of the vocal tract.
In direct synthesis, parts of the speech signal are combined to produce the corresponding words based on stored signals (e.g. one signal is stored per phoneme) or the transfer function of the vocal tract used by humans to create speech is simulated by the energy of a signal in certain frequency ranges. In this manner vocalized sounds are represented by the quasi-periodic excitation of a certain frequency.
The term ‘phoneme’ mentioned above is the smallest unit of language that can be used to differentiate meanings but that does not have any meaning itself. Two words with different meanings that differ by only a single phoneme (e.g. fish/wish, woods/wads) create a minimal pair. The number of phonemes in a language is relatively small (between 20 and 60). The German language uses about 45 phonemes.
To take the characteristic transitions between phonemes into account, diphones are usually used in direct speech synthesis. Simply stated, a diphone can be defined as the space between the invariable part of the first phoneme and the invariable part of the second phoneme.
Phonemes and sequences of phonemes are written using the International Phonetic Alphabet (IPA). The conversion of a piece of text to a series of characters belonging to the phonetic alphabet is called phonetic transcription.
In synthesis using a model, a production model is created that is usually based on minimizing the difference between a digitized human speech signal (original signal) and a predicated signal.
The simulation of the vocal tract is another method. In this method the form and position of each organ used to articulate speech (tongue, jaws, lips) is modeled. To do this, a mathematical model of the airflow characteristics in a vocal tract defined in this manner is created and the speech signal is calculated using this model.
Short explanations of other terms and methods used in conjunction with speech synthesis will be given in the following.
The phonemes or diphones used in direct synthesis must first be obtained by segmenting the natural language. There are two approaches used to accomplish this:
In implicit segmentation only the information contained in the speech signal itself is used for segmentation purposes.
Explicit segmentation, on the other hand, uses additional information such as the number of phonemes in the utterance.
To segment an utterance, features must first be extracted from the speech signal. These features can then be used as the basis for differentiating between segments.
These features are then classified.
Possible methods for extracting features are spectral analysis, filter bank analysis or the linear prediction method, amongst others.
Hidden Markov models, artificial neural networks or dynamic time warping (a method for normalizing time) are used to classify the features, for example.
The Hidden Markov Model (HMM) is a two-stage stochastic process. It consists of a Markov chain, usually with a low number of states, to which probabilities or probability densities are assigned. The speech signals and/or their parameters described by probability densities can be observed. The intermediate states themselves remain hidden. HMMs have become the most widely used models due to their high performance and robustness and because they are easy to train when used in speech recognition.
The Viterbi algorithm can be used to determine how well several HMMs correlate.
More recent approaches use multiple self-organizing maps of features (Kohon maps). This special type of artificial neural network is able to simulate the processes carried out in the human brain.
A widely used approach is the classification into voiced/unvoiced/silence in accordance with the various excitation forms arising during the creation of speech in the vocal tract.
Regardless of which of the synthesis techniques are used, a problem still remains with text-based synthesis devices. The problem is that even if there is a relatively high degree of correlation between the pronunciation of a text or stored train of characters, there are still words in every language whose pronunciation cannot be determined from the spelling of the word if no context is given. In particular, it is often impossible to specify general phonetic pronunciation rules for proper names. For example, the names of the cities “Itzehoe” and “Laboe” have the same ending, even though the ending for Itzehoe is pronounced “oe” and the ending for Laboe is pronounced “ö”. If these words are provided as trains of characters for synthetic reproduction, then the application of a general rule would lead to the endings of both city names in the example above being pronounced either “ö” or “oe”, which would result in an incorrect pronunciation when the “ö” version” is used for Itzehoe and when the “oe” version is used for Laboe. If these special cases are to be taken into consideration, then it is necessary to subject the corresponding words of that language to special treatment for reproduction. However, this also means that it is not possible anymore to use pure text-based input for any words intended to be reproduced later on.
Due to the fact that giving certain words in a language special treatment is extremely complex, announcements to be output by voice-controlled devices are now made up of a combination of spoken and synthesized speech. For example, for a route finder, the desired destination, which is specified by the user and which often displays peculiarities in terms of its pronunciation as compared to other words in the corresponding language, is recorded and copied to the corresponding destination announcement in voice-controlled devices. For the destination announcement “Itzehoe is three kilometers away”, this would cause the text written in cursive to be synthesized and the rest, the word “Itzehoe”, to be taken from the user's destination input. The same set of circumstances also arises when setting up mail boxes where the user is required to input his or her name. In this case, in order to avoid these complexities the announcement played back when a caller is connected to the mailbox is created from the synthesized portion “You have reached the mailbox of” and the original text, e.g. “John Smith”, which was recorded when the mailbox was set up.
Apart from th

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Text-based speech synthesis method containing synthetic... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Text-based speech synthesis method containing synthetic..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Text-based speech synthesis method containing synthetic... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3076870

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.