Electrical audio signal processing systems and devices – Monitoring/measuring of audio devices – Loudspeaker operation
Patent
1989-06-19
1992-10-06
Shaw, Dale M.
Electrical audio signal processing systems and devices
Monitoring/measuring of audio devices
Loudspeaker operation
381 52, G10L 501
Patent
active
051539136
DESCRIPTION:
BRIEF SUMMARY
Background of the Invention
1. Field of the Invention
This invention relates to a method and apparatus for generating speech from a library of prerecorded, digitally stored, spoken, coarticulated speech segments and includes generating such speech by expanding and connecting in real time, digital time domain compressed coarticulated speech segment data.
2. Background Information
A great deal of effort has been expended in attempts to artificially generate speech. By artificially generating speech it is meant for the purposes of this discussion selecting from a library of sounds a desired sequence of utterances to produce a desired message. The sounds can be recorded human sounds or synthesized sounds. In the latter case, the characteristic sounds of a particular language are analyzed and waveforms of the dominant frequencies, known as formants, are generated to synthesize the sound.
The sounds, whether recorded human sounds or synthesized sounds, from which speech is artificially generated can, of course be complete words in the given language. Such an approach, however, produces speech with a limited vocabulary capability or requires a tremendous amount of data storage space.
In order to more efficiently generate speech, systems have been devised which store phonemes, which are the smallest units of speech that serve to distinguish one utterance from another in a given language. These systems operate on the principle that any word may be generated through proper selection of a phoneme or a sequence of phonemes. For instance, in the English language there are approximately 40 phonemes, so that any word in the English language can be produced by a suitable combination of these 40 phonemes. However, the sound of each phoneme is affected by the phonemes which precede and succeed it in a given word. As a result, systems to date which concatenate together phonemes have been only moderately successful in generating understandable, let alone natural sounding speech.
It has long been recognized that diphones offer the possibility of generating realistic sounding speech. Diphones span two phonemes and thus take into account the effect on each phoneme of the surrounding phonemes. The basic number of diphones then in a given language is equal to the square of the number of phonemes less any phoneme pairs which are never used in that language. In the English language this accounts for somewhat less than 1600 diphones. However, in some instances a phoneme is affected by other phonemes in addition to those adjacent, or there is a blending of adjacent phonemes. Thus, a library of diphones for the English language may include up to about 1700 entries to accommodate all the special cases.
The diphone is referred to as a coarticulated speech segment since it is composed of smaller speech segments, phonemes, which are uttered together to produce a unique sound. Larger coarticulated speech segments than the diphone, include syllables, demisyllable (two syllables), words and phrases. As used throughout, the term coarticulated speech segment is meant to encompass all such speech.
While it may be possible to construct a speech generator which produces a desired message from whole words or phases stored in analog form, access times required for generating real time speech from phonemes, diphones or syllables must be implemented using digital storage techniques. However, the complex wave forms of speech require a great deal of data storage to produce quality speech. Digital storage of words and phrases also provides better access times, but requires even greater storage capacity.
In digitally storing sounds, the desired waveform is pulse code modulated by periodically sampling waveform amplitude. As is well known, the bandwidth of the digital signal is only one half the sampling rate. Thus for a bandwidth of 4 KHz a sampling rate of 8 KHz is required. Furthermore, because of the wide dynamic range of speech signals, quality reproduction requires that each sample have a sufficient number of bits to provide adequate resolution of w
REFERENCES:
patent: 4319084 (1982-03-01), Lucchini et al.
patent: 4437087 (1984-03-01), Petr
patent: 4672670 (1987-06-01), Wang et al.
patent: 4685135 (1987-08-01), Lin et al.
patent: 4691359 (1987-09-01), Morito
patent: 4799261 (1989-01-01), Lin et al.
patent: 4833718 (1989-05-01), Sprague
Electronique Industrielle No. 70/1-05-1984 Synthese de la parole: presque de la HiFi!, pp. 37-42.
298 N.E.C. Research & Development, (1984), Apr., No. 73, Tokyo, Japan, SR-2000 Voice Processor and Its Applications, pp. 98-105.
IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-22, No. 5, Oct. 1974, A Multiline Computer Voice Response System Utilizing ADPCM Coded Speech, Rosenthal et al., pp. 339-352.
Kandefer Edward M.
Mosenfelder James R.
Doerrler Michelle
Shaw Dale M.
Sound Entertainment, Inc.
Westerhoff Richard V.
LandOfFree
Generating speech from digitally stored coarticulated speech seg does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Generating speech from digitally stored coarticulated speech seg, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Generating speech from digitally stored coarticulated speech seg will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-1196391