Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
1998-12-16
2001-03-27
Hudspeth, David R. (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
C704S261000
Reexamination Certificate
active
06208968
ABSTRACT:
BACKGROUND OF THE INVENTION
Generally speaking, a “speech synthesizer” is a computer device or system for generating audible speech from written text. That is, a written form of a string or sequence of characters (e.g., a sentence) is provided as input, and the speech synthesizer generates the spoken equivalent or audible characterization of the input. The generated speech output is not merely a literal reading of each input character, but a language dependent, in-context verbalization of the input. If the input was the phone number (508) 691-1234 given in response to a prior question of “What is your phone number?”, the speech synthesizer does not produce the reading “parenthesis, five hundred eight, close parenthesis, six hundred ninety-one . . . ” Instead, the speech synthesizer recognizes the context and supporting punctuation and produces the spoken equivalent “five (pause) zero (pause) eight (pause) six . . . ” just as an English-speaking person normally pronounces a phone number.
Historically the first speech synthesizers were formed of a dictionary, engine and digital vocalizer. The dictionary served as a look-up table. That is, the dictionary cross referenced the text or visual form of a character string (e.g., word or other unit) and the phonetic pronunciation of the character string/word. In linguistic terms the visual form of a character string unit (e.g., word) is called a “grapheme” and the corresponding phonetic pronunciation is termed a “phoneme”. The phonetic pronunciation or phoneme of character string units is indicated by symbols from a predetermined set of phonetic symbols.
The engine is the working or processing member that searches the dictionary for a character string unit (or combination thereof) matching the input text. In basic terms, the engine performs pattern matching between the sequence of characters in the input text and the sequence of characters in “words” (character string units) listed in the dictionary. Upon finding a match, the engine obtains from the dictionary entry (or combination of entries) of the matching word (or combination of words), the corresponding phoneme or combination of phonemes. To that end, the purpose of the engine is thought of as translating a grapheme (input text) to a corresponding phoneme (the corresponding symbols indicating pronunciation of the input text).
Typically the engine employs a binary search through the dictionary for the input text. The dictionary is loaded into the computer processor physical memory space (RAM) along with the speech synthesizer program. The memory footprint, i.e., the physical memory space in RAM needed while running the speech synthesizer program, thus must be large enough to hold the dictionary. Where the dictionary portion of today's speech synthesizers continue to grow in size, the memory footprint is problematic due to the limited available memory (RAM and ROM) in some/most applications.
The digital vocalizer receives the phoneme data generated by the engine. Based on the phoneme data together with timing and stress data, the digital vocalizer generates sound signals for “reading” or “speaking” the input text. Typically, the digital vocalizer employs a sound and speaker system for producing the audible characterization of the input text.
To improve on memory requirements of speech synthesizers, another design was developed. In that design, the dictionary is replaced by a rule set. Alternatively, the rule set is used in combination with the dictionary instead of completely substituting therefor. At any rate, the rule set is a group of statements in the form
IF (condition)-then-(phonemic result)
Each such statement determines the phoneme for a grapheme that matches the IF condition. Examples of rule-based speech synthesizers are DECTALK by Digital Equipment Corporation of Maynard, Mass. and TrueVoice by Centigram Communications of San Jose, Calif. Though the use of rule sets reduces the number of entries required in a dictionary for a speech synthesizer system, the dictionaries remain relatively large in size (i.e., number of entries) compared to other parts of the system requiring memory. This is problematic because dictionaries must be completely stored in memory during the speech synthesis process to ensure fast and efficient look-up of entries if needed.
These and other problems exist in speech synthesizer technology. New solutions have been attempted but with little success. As a result, highly accurate and/or memory space efficient speech synthesizers are yet to come.
SUMMARY OF THE INVENTION
Dictionaries used by text-to-speech synthesis systems may grow to become quite large. Dictionary size depends on how many words or word portions in a particular language are determined to be too complex, too difficult or too time consuming to translate into phonemes by rule set processing alone. Such words or word portions are candidates to be included as entries in the dictionary. However, certain problems are encountered when large dictionaries are used in text-to-speech synthesis systems as mentioned above.
The invention recognizes the problems with prior art text-to-speech synthesis systems that use dictionaries and provides a method and apparatus to reduce the overall size of the dictionaries used in such systems. Specifically, the invention uses a two phase dictionary reduction process to eliminate entries that are not required in the dictionary. In phase one, any entries in the dictionary with respective phonemes that can be fully generated by rules in a rule set are marked or indicated to be deleted from the dictionary. In phase two, any entries in the dictionary, called root word entries, that can provide phonemes for the text-to-speech translation process of larger (longer) entries are marked or indicated to be saved in the dictionary, and the entries of longer character strings that can be translated using the shorter root word entries in conjunction with rules are indicated to be deleted from the dictionary. After phase one and/or phase two are complete, the invention aggregates the entries marked to be saved or removes the entries marked to be deleted and the resulting set of entries is stored as the reduced dictionary.
Phase one or phase two of the invention each may be performed independently, followed by the aggregation step. Alternatively, phase one may be followed by phase two and then by the aggregation process.
In order for embodiments of phase one to determine if the phoneme of an entry in the dictionary can be fully generated (and hence the dictionary entry can be fully matched) by using the rule set, the invention method and apparatus generate a rule-based phoneme string for the grapheme string of the subject entry and then determine if the rule-based phoneme string matches the corresponding phoneme string of the entry. If there is a match, the subject entry is indicated to be deleted from the dictionary, thus reducing overall dictionary size. Since rules alone can produce the required phoneme string for the subject entry, the invention recognizes that there is no need for the entry to remain in the dictionary.
Embodiments of phase one may also check if the grapheme string of a dictionary entry is a homograph. If so, the preferred embodiment skips to the next entry in the dictionary for processing. A homograph is a word that can be pronounced two different ways but which has one spelling, such as “abstract”, “wind”, and “record”. Due to multiple pronunciations, homograph dictionary entries are skipped since they may have more than one associated phoneme string. During text-to-speech processing, the correct phoneme string is selected from a homograph dictionary entry based on the context of surrounding language in the text being translated.
Embodiments of phase two determine if dictionary entries, referred to as root word entries, are required in the dictionary. This is accomplished by the invention combining grapheme and phoneme strings of the root word entry from the dictionary with respective grapheme and phoneme portions of an affix rule of an affix
Kopec Thomas
Lin Ginger Chun-Che
Vitale Anthony J.
Azad Abul K.
Compaq Computer Corporation
Hamilton Brook Smith & Reynolds P.C.
Hudspeth David R.
LandOfFree
Computer method and apparatus for text-to-speech synthesizer... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Computer method and apparatus for text-to-speech synthesizer..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Computer method and apparatus for text-to-speech synthesizer... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2456536