Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1998-03-31
2001-01-09
Hudspeth, David R. (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S231000
Reexamination Certificate
active
06173260
ABSTRACT:
FIELD OF THE INVENTION
The present invention is generally directed to the field of affective computing, and more particularly concerned with the automatic classification of speech signals on the basis of prosodic information contained therein.
BACKGROUND OF THE INVENTION
Much of the work that has been done to date in connection with the analysis of speech signals has concentrated on the recognition of the linguistic content of spoken words, i.e., what was said by the speaker. In addition, some efforts have been directed to automatic speaker identification, to determine who said the words that are being analyzed. However, the automatic analysis of prosodic information conveyed by speech has largely been ignored. In essence, prosody represents all of the information in a speech signal other than the linguistic information conveyed by the words, including such factors as its duration, loudness, pitch and the like. These types of features provide an indication of how the words were spoken, and thus contain information about the emotional state of the speaker.
Since the affective content of the message is conveyed by the prosody, it is independent of language. In the field of affective computing, therefore, automatic recognition of prosody can be used to provide a universal interactive interface with a speaker. For example, detection of the prosody in speech provides an indication of the “mood” of the speaker, and can be used to adjust colors and images in a graphical user interface. In another application, it can be used to provide interactive feedback during the play of a video game, or the like. As other examples, task-based applications such as teaching programs can employ information about a user to adjust the pace of the task. Thus, if a student expresses frustration, the lesson can be switched to less-demanding concepts, whereas if the student is bored, a humorous element can be inserted. For further information regarding the field of affective computing, and the possible applications of the prosodic information provided by the present invention, reference is made to
Affective Computing
by R. W. Picard, MIT Press, 1997.
Accordingly, it is desirable to provide a system which is capable of automatically classifying the prosodic information in speech signals, to detect the emotional state of the speaker. In the past, systems have been developed which classify the spoken affect in speech, which are based primarily upon analysis of the pitch content of speech signals. See, for example, Roy et al, “Automatic Spoken Affect Classification and Analysis”,
IEEE Face and Gesture Conference
, Killington, Vt., pages 363-367, 1996.
SUMMARY OF THE INVENTION
The present invention is directed to a method and system for classifying speech according to emotional content, which employs acoustic measures in addition to pitch as classification input, in an effort to increase the accuracy of classification. In a preferred embodiment of the invention, two different kinds of features in a speech signal are analyzed for classification purposes. One set of features is based on pitch information that is obtained from a speech signal, and the other set of features is based on changes in the spectral shape of the speech signal. Generally speaking, the overall spectral shape of the speech signal can be used to distinguish long, smoothly varying sounds from quickly changing sound, which may indicate the emotional state of the speaker. Different variations of pitch and spectral shape features can be measured and analyzed, to assist in the classification of portions of speech, such as individual utterances.
In a further preferred embodiment, each selected portion of the speech is divided into three segments for analysis, namely the first, middle and last third of a sound. Each of these three segments is analyzed with respect to the various feature parameters of interest. In addition, the total duration of an utterance can be analyzed with respect to each parameter of interest, to provide various global measures. A subset of these measures is then employed for classification purposes.
Further aspects of the invention are described hereinafter in greater detail, with reference to various embodiments illustrated in the accompanying drawings.
REFERENCES:
patent: 3855417 (1974-12-01), Fuller
patent: 4490840 (1984-12-01), Jones
patent: 5537647 (1996-07-01), Hermansky et al.
patent: 5598505 (1997-01-01), Austin et al.
patent: 5774850 (1998-06-01), Hattori et al.
R. Cowie, M. Sawey, and E. Douglas-Cowie, “A New Speech Analysis System: ASSESS (Automatic Statistical Summary of Elementary Speech Structures),” Proc. 13th Int. Cong of Phonetic Sciences, ICPhs 95, Stockholm, Sweden, Aug. 13-19, 1995, pp. 278-281.
Chen, Lawrence S. et al, “Multimodal Human Emotion/Expression Recognition”, Third IEEE International Conference on Automatic Face and Gesture Recognition, Apr. 14-16, 1998, pp. 366-371.
Rabiner, Lawrence R., “Distortion Measures—Perceptual Considerations”, Fundamentals of Speech Recognition, 1993, pp. 150-200.
Rabiner, Lawrence R., “Linear Predictive Coding of Speech”, Digital Processing of Speech Signals, 1978, pp. 396-461.
Roy, Deb et al, “Automatic Spoken Affect Classification and Analysis”, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, Oct. 14-16, 1996, pp. 363-367.
Slaney, Malcolm et al, “Baby Ears: A recognition System for Affective Vocalizations”, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, May 12-15, 1998, pp. 985-988.
“Affective Computing”, Affective Computing Research Group at the MIT Media Lab; Oct. 25, 1996, pp. 1-38.
“Cross-Validation and Other Estimates of Prediction Error”; Chapter 17, pp. 237-257.
Agranovski, A.V. et al, “The Research of Correlation Between Pitch and Skin Galvanic Reaction at Change of Human Emotional State”; Eurospeech '97; Spetsvuzavtomatika Design Burear, 51 Gazetny St., Rostov-on-Don, Russia. pp. 1-4.
Bachorowski, Jo-Anne et al, “Vocal Expression of Emotion:Acoustic Properties of Speech Are Associated With Emotional Intensity and Context”; American Psychological Society, vol. 6, No. 4, Jul. 1995, pp. 219-224.
Cahn, Janet E., “An Investigation into the Correlation of Cue Phrases, Unfilled Pauses and the Structuring of Spoken Discourse”, Media Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, pp. 1-12.
Cahn, Janet E., “The Generation of Affect in Synthesized Speech”, Journal of the American Voice I/O Society, Jul. 1990, vol. 8, pp. 1-19.
Cummings, Kathleen E. et al, “Analysis of Glottal Waveforms Across Stress Styles”; School of Electrical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332-0250; pp. 369-372, Apr. 3-6, 1990.
Efron, Bradley et al, “A Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation”, The American Statistician, Feb. 1983, vol. 37, No. 1, pp. 36-48.
Efron, Bradley et al, “Improvements on Cross-Validation: The .632+ Bootstrap Method”, The American Statistician, Jun. 1997, vol. 92, No. 438, pp. 548-560.
Engberg, Inger S. et al., “Design, Recording and Verification of a Danish Emotional Speech Database”; Eurospeech '97; Center for PersonKommunikation, Aalborg University, Frederik Bajers Veh7 A2, 9220 Aalborg Ost, Denmark, pp. 1-4.
Fernald, Anne, “Human Maternal Vocalizations to Infants as Biologically Relevant Signals: An Evolutionary Perspective”, Parental Care and Children, pp. 391-428.
Fernald, Anne, “Intonation and Communicative Intent in Mothers' Speech to Infants: Is the Melody the Message?”, Stanford University, pp. 1497-1510.
Huron, David, “Sound, Music and Emotion: A Tutorial Survey of Research”.
Jansens, Susan et al; “Perception and Acoustics of Emotions in Singing”; Computer and Humanities Department, Utrecht Institute of Linguistics-OTS, University of Utrecht, Trans 10, 3512 JK Utrecht, the Netherlands, pp. 1-4.
Lipscomb, Scott D. et al, “Perceptual Judgement of the Relationship Between Musical and Visual Components in Film”; Psychomusicology, Spring/Fall 1994, pp. 60
Burns Doane Swecker & Mathis L.L.P.
Hudspeth David R.
Interval Research Corporation
Storm Donald L.
LandOfFree
System and method for automatic classification of speech... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for automatic classification of speech..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for automatic classification of speech... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2513425