Rule-based learning of word pronunciations from training...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S254000

Reexamination Certificate

active

06411932

ABSTRACT:

FIELD OF THE INVENTION
This invention relates to text-to-pronunciation systems and more particularly to rule-based learning of word pronunciations from training corpora or set of pronunciations.
BACKGROUND OF THE INVENTION
In this document we will frequently refer to phonemes and graphemes (letters). Graphemes are enclosed in single quotation marks (e.g. ‘abc’). In fact, any symbol(s) within single quotation marks refer to graphemes, or grapheme sequences.
Phonemes or phoneme sequences are enclosed in parentheses; (′m uw). We Will use the ASCII representation for the English phoneme set. See Charles T. Hemphill,
EPHOD, Electronic PHOnetic Dictionary
, Texas Instruments, Dallas, Tex., USA, Edition 1.1, May 12, 1995. Stress levels are usually not marked in most examples, as they are not important to the discussion. In fact, we will assume that the stress information directly belongs to the vowels, so the above phoneme sequence will be denoted as (m ′uw) or simply (m uw). Schwas are represented either as unstressed vowels (.ah) or using their special symbol (ax) or (.ax).
Grapheme-phoneme correspondences (partial or whole pronunciations) are represented by connecting the graphemes to the phoneme sequence (e.g. ‘word’→(w er d)). The grapheme-phoneme correspondences usually do not contain stress marks.
Grapheme or phoneme contexts are represented by listing the left and right contexts, and representing the symbol of interest with an underscore (e.g. (b

1)). Word boundaries in contexts are denoted with a dollar sign (e.g. ‘$x_’).
In this decade, speech as a medium is becoming a more prevalent component in consumer computing. Games, office productivity and entertainment products use speech as a natural extension to visual interfaces. Some programs use prerecorded digital audio files to produce speech, while other programs use speech synthesis systems. The advantage of the latter system is that they can generate a broad range of sentences, and thus, they can be used for presenting dynamic information. Nevertheless, their speech quality is usually lower than that of prerecorded audio segments.
Speech recognition systems are also becoming more and more accessible to average consumers. A drawback of these systems is that speech recognition is a computationally expensive process and requires a large amount of memory; nonetheless, powerful computers are becoming available for everyday people.
Both speech synthesis and speech recognition rely on the availability of pronunciations for words or phrases. Earlier systems used pronunciation dictionaries to store word pronunciations. However, it is possible to generate word pronunciations from language-specific pronunciation rules. In fact, systems starting from the early stages have been using algorithms to generate pronunciations for words not in their pronunciation dictionary. Also, since pronunciation dictionaries tend to be large, it would be reasonable to store pronunciations only for words that are difficult to pronounce, namely, for words that the pronunciation generator cannot correctly pronounce.
Speech recognizers are becoming an important element of communication systems. These recognizers often have to recognize arbitrary phrases, especially when the information to be recognized is from an on-line, dynamic source. To make this possible, the recognizer has to be able to produce pronunciations for arbitrary words. Because of space requirements, speech systems need a compact yet robust method to make up word pronunciations.
There are a myriad of approaches that have been proposed for text-to-pronunciation (TTP) systems. In addition to using a simple pronunciation dictionary, most systems use rewrite rules which have proven to be quite well-adapted to the task at hand. Unfortunately, these rules are handcrafted; thus, the effort put into producing these rules needs to be repeated when a new language comes into focus. To solve this problem, more recent methods use machine-learning techniques, such as neural networks, decision trees, instance-based learning, Markov models, analogy-based techniques, or data-driven solutions to automatically extract pronunciation information for a specific language. See François Yvon,
Grapheme
-
to
-
Phoneme Conversion Using Multiple Unbounded Overlapping Chunks
, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NEW METHODS IN LANGUAGE PROCESSING, No. 2, Ankara, Turkey, 1996. Also in internet address xxx.lanl.gov/list/cmp-lg/9608#cmp-lg/9608006.
A review of some of the approaches follows. It is difficult to objectively compare the performance of these methods, as each is trained and tested using different corpora and different scoring functions. Nevertheless, an overall assessment is presented of each approach.
The simplest way to generate word pronunciations is to store them in a pronunciation dictionary. The advantage of this solution is that the lookup is very fast. In fact, we can have a constant lookup time if we use a hash table. It is also capable of capturing multiple pronunciations for words with no additional complexity. The major drawback of dictionaries is that they cannot seamlessly handle words that are not in them. They also take up a lot of space (O(N), where N is the number of words). O (f(x)) is the mathematical notation for the order of magnitude.
A somewhat more flexible solution is to generate pronunciations for words based on their spelling. In a pronunciation system developed by Advanced Research Projects Agency (ARPA); each letter (grapheme) is pronounced based on its grapheme context. An example for English would be to
pronounce ‘e’ in the context ‘_r$’ as (er).  (1.1)
The system consists of a set of rules, each containing a letter context and a phoneme sequence (pronunciation) corresponding to the letter of interest underlined. The representation of the above rule (1.1) would be:

e
r$’→(er).  (1.2)
These pronunciation rules are generated by a human expert for the given language. The advantage of this system is that it can produce pronunciations for unknown words; in fact, every word is treated as unknown. Also, this method can encapsulate pronunciation dictionaries, as entire words can be used as contexts. Furthermore, this method can produce multiple pronunciations for words, since the phoneme sequences in the rules can be arbitrary. The disadvantage of the system is that it cannot take advantage of phonetic features; thus, it requires an extensive rule set. Also, a human expert is needed to produce the rules; therefore, it is difficult to switch to a different language. Moreover, it pronounces each letter as a unit, which seems counter-intuitive.
The rule-based transliterator (RBT) uses transformation rules to produce pronunciations. See Caroline B. Huang et al.,
Generation of Pronunciations from Orthographies Using Transformation
-
Based Error
-
Driven Learning
, INTERNATIONAL CONFERENCE ON SPEECH AND LANGUAGE PROCESSING, pp 411-414, Yokohama, Japan, 1994. It was written in the framework of the theory of phonology by Chomsky and Halle, and it uses phonetic features and phonemes. See Noah Chomsky and M. Halle,
The Sound Pattern of English
, HARPER & Row, New York, New York, USA, 1968. Rewrite rules are formulated as
&agr;→&bgr;/&ggr;&dgr;  (1.3)
which stands for
&agr; is rewritten as &bgr; in the context of &ggr; (left) and &dgr; (right).
Here, &agr;, &bgr;, &ggr;, and &dgr; can each be either graphemes or phonemes. Each phoneme is portrayed as a feature bundle; thus, rules can refer to the phonetic features of each phoneme. Rewrite rules are generated by human experts, and are applied in a specific order.
This method is similar to the simple context-based ARPA method described above. One improvement is that this system can make use of phonetic features to generalize pronunciation rules. Also, it can capture more complex pronunciation rules because applied rules change the pronunciations which become the context for future rules. The major disadvantage of this solution is that a human expert is still needed to pro

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Rule-based learning of word pronunciations from training... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Rule-based learning of word pronunciations from training..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Rule-based learning of word pronunciations from training... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2922444

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.