Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-11-04
2002-04-23
Tsang, Fan (Department: 2645)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S270000
Reexamination Certificate
active
06377919
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to systems and methods for automatically describing human speech, and more particularly to systems and methods for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing human/animate speech.
2. Discussion of Background Art
Sound characterization, simulation, and noise removal relating to human speech is a very important ongoing field of research and commercial practice. Use of EM sensors and acoustic microphones for purposes of human speech characterization has been described in the referenced application, Ser. No. 08/597,596 to the U.S. patent office, which is incorporated herein by reference. Said patent application describes methods by which EM sensors can measure positions versus time of human speech articulators, along with substantially simultaneous measured acoustic speech signals for purposes of more accurately characterizing each segment of human speech. Furthermore, the said patent application describes valuable applications of said EM sensor and acoustic methods for purposes of improved speech recognition, coding, speaker verification, and other applications.
A second related U.S. patent issued on Mar. 17, 1998 as U.S. Pat. No. 5,729,694, titled “Speech Coding, Reconstruction and Recognition Using Acoustics and Electromagnetic Waves,” by J. F. Holzrichter and L. C. Ng is also incorporated herein by reference. Patent '694 describes methods by which speech excitation functions of human (or similar animate objects) are characterized using EM sensors, and the substantially simultaneously acoustic speech signal is then characterized using generalized signal processing technique. The excitation characterizations described in '694, as well as in application Ser. No. 08/597,596, rely on associating experimental measurements of glottal tissue interface motions with models to determine an air pressure or airflow excitation function. The measured glottal tissue interfaces include vocal folds, related muscles, tendons, cartilage, as well as, sections of a windpipe (e.g. glottal region) directly below and above the vocal folds.
The described procedures in application Ser. No. 08/597,596, enable new and valuable methods for characterizing the substantially simultaneously measured acoustic speech signal, by using the non-acoustic EM signals from the articulators and acoustic structures as additional information. Those procedures use the excitation information, other articulator information, mathematical transforms, and other numerical methods, and describes the formation of feature vectors of information that numerically describe each speech unit, over each defined time frame using the combined information. This characterizing speech information is then related to methods and systems, described in said patents and applications, for improving speech application technologies such as speech recognition, speech coding, speech compression, synthesis, and many others.
Another important patent application that is herein incorporated by reference is U.S. patent Ser. No. 09/205,159 entitled “System and Method for Characterizing, Synthesizing, and/or Canceling Out Acoustic Signals From Inanimate Sound Sources,” filed on Dec. 2, 1998 by G. C. Burnett, J. F. Holzrichter, and L. C. Ng. This invention application relates generally to systems and methods for characterizing, synthesizing, and/or canceling out acoustic signals from inanimate sound sources, and more particularly for using electromagnetic and acoustic sensors to perform such tasks.
Existing acoustic speech recognition systems suffer from inadequate information for recognizing words and sentences with high probability. The performance of such systems also drops rapidly when noise from machines, other speakers, echoes, airflow, and other sources are present.
In response to the concerns discussed above, what is needed is a system and method for automated human speech that overcomes the problems of the prior art. The inventions herein describe systems and methods to improve speech recognition and other related speech technologies.
SUMMARY OF THE INVENTION
The present invention is a system and method for characterizing voiced speech excitation functions (human or animate) and acoustic signals, for removing unwanted acoustic noise from a speech signal which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller.
The system and method of the present invention is particularly advantageous because a low power EM sensor detects tissue motion in a glottal region of a human speech system before, during, and after voiced speech. This is easier to detect than a glottis itself. From these measurements, a human voiced excitation function can be derived. The EM sensor can be optimized to measure sub-millimeter motions of wall tissues in either a sub-glottal or supra-glottal region (i.e., below or above vocal folds), as vocal folds oscillate (i.e., during a glottal open close cycle). Motions of the sub-glottal wall or supra-glottal wall provide information on glottal cycle timing, on air pressure determination, and for constructing a voiced excitation function. Herein, the terms glottal EM sensor and glottal radar and GEMS (i.e., glottal electromagnetic sensor) are used interchangeably.
Air pressure increases and decreases in the sub-glottal region, as vocal folds close (obstructing airflow) and then open again (enabling airflow), causing the sub-glottal walls to expand and then contract by dimensions ranging from <0.1 mm up to 1 mm. In particular, a rear wall (posterior) section of a trachea is observed to respond directly to increases in sub-glottal pressure as vocal folds close. Timing of air pressure increase is directly related to vocal fold closure (i.e., glottal closure). Herein “trachea” and “sub-glottal windpipe” refer to a same set of tissues. Similarly, supra-glottal walls in a pharynx region, expand and contract, but in opposite phase to sub-glottal wall motion. For this document “pharynx” and the “supra-glottal region” are synonyms; also, “time segment” and “time frame” are synonyms.
Methods of the present invention describe how to obtain an excitation function by using a particular tissue motion associated with glottis opening and closing. These are wall tissue motions, which are measured by EM sensors, and then associated with air pressure versus time. This air pressure signal is then converted to an excitation function of voiced speech, which can be parameterized and approximated as needed for various applications. Wall motions are closely associated with glottal opening and closing and glottal tissue motions.
The windpipe tissue signals from the EM sensor also describe periods of no speech or of unvoiced speech. Using the statistics of the user's language, the user of these methods can estimate, to a high degree of certainty, time periods wherein no vocal-fold motion means time periods of no speech, and time periods where unvoiced speech is likely. In addition, unvoiced speech presence and qualities can be determined using information from the EM sensor measuring glottal region wall motion, from a spectrum of a corresponding acoustic signal, and (if used) signals from other EM sensors describing processes of vocal fold retraction, or pharynx diameter enlargement, jaw motions, or similar activities.
The EM sensor signals that describe vocal tract tissue motions can also be used to determine acoustic signals being spoken. Vocal tract tissue walls (e.g., pharynx or soft palate), and/or tissue surfaces (e.g., tongue or lips), and/or other tissue surfaces connected to vocal tract wall-tissues (e.g., neck-skin or outer lip surfaces), vibrate in response to acoustic speech signals that propagate in the vocal tract. The EM sensors described in the '596 patent and elsewhere herein, and also methods of tissue response-function removal, enabl
Burnett Greg C.
Holzrichter John F.
Ng Lawrence C.
Opsasnick Michael N.
Scott Eddie E.
The Regents of the University of California
Thompson Alan H.
Tsang Fan
LandOfFree
System and method for characterizing voiced excitations of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for characterizing voiced excitations of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for characterizing voiced excitations of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2921076