Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
1999-01-08
2001-08-14
Tsang, Fan (Department: 2748)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S211000
Reexamination Certificate
active
06275795
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a method and apparatus for extracting information from speech. The invention has particular, although not exclusive relevance to the extraction of articulatory feature information from a speaker as he/she speaks.
2. Description of the Prior Art
There are several known techniques for diagnosing speech disorders in individual speakers, most of which rely on a comparison of various articulatory features, i.e. the positions of the lips, tongue, mouth, etc, of a “normal” speaker with those of the individual being diagnosed. One technique relies on a clinician extracting from the individual's speech the phonetic content, i.e. the string of phones that make up the speech. Each phone is produced by a unique combination of simultaneously occurring distinct articulatory features, and therefore the articulatory features can be determined and compared with those of a “normal” speaker. However, there are several disadvantages of this technique.
The first disadvantage with this technique is that it is not practical to have a phone for every possible combination of articulatory feature values. Consequently, only the most frequent combination of articulatory feature values are represented by the set of phones, and so many possible articulations are not represented.
A second disadvantage of a phonetic technique is that the speech is considered as being a continuous stream of phones. However, such a concept of speech is not accurate since it assumes that all the articulatory features change together at the phone boundaries. This is not true since the articulatory features change asynchronously in continuous speech, which results in the acoustic realisation of a phone being dependent upon its neighbouring phones. This phenomenon is called co-articulation. For example for the phrase “did you” the individual phonemes making up this phrase are:
“/d ih d y uw/”
However, the phonetic realisation of the phrase given above during continuous speech, would be:
“/d ih j h uw”
The final d in “did” is modified and the word “you” becomes converted to a word that sounds like “juh”.
A third disadvantage with this technique is that a clinician has to make a phonetic transcription of the individual's speech which is (i) time consuming; (ii) costly, due to the requirement of a skilled clinician; and (iii) unreliable due to possible human error.
Another type of technique uses instruments to determine the positions of the articulatory structures during continuous speech. For example, cinefluorography which involves the photographing of x-ray images of the speaker is one such technique. In order to analyse movement of the articulatory structures, sequences of individual cinefluorographic frames are traced, and measurements are made from the tracings using radiopaque beads, skeletal structures, and/or articulators covered with radiopaque substances.
However, there are a number of disadvantages associated with the use of cinefluorographic techniques—
i) there is a danger of radiation exposure, therefore, the size of the speech sample must be restricted;
ii) the acquisition of data must be under supervision of a skilled radiologist which results in high cost;
iii) the body must be stabilised which might result in an unnatural body posture which may affect the articulation; and
iv) variations in the x-ray data obtained from individual to individual results in reduced reliability of the data measurements.
Ultrasonic imaging is another instrumental technique that allows observation of the dynamic activity of the articulatory structures, but does not interfere with the articulatory structures activity, nor does it expose the subject to radiation. Ultrasonic imaging uses the reflection of ultrasonic waves from the interface between two media. Since the time between the initiation of the ultrasonic pulses and the return is proportional to the distance from the transmitter to the boundary, information relating to the reflected waves may be used to produce a time-amplitude display indicative of the structure reflecting the waves. This technique, however, suffers from the problem that the observer is not exactly sure of the point on the structure that he is measuring the return from, and also the transmitter and receiver must be at 90° to the interface. Therefore, when trying to characterise speech disorders by structural anomalies, it may be particularly difficult to identify the point on the structure being monitored.
A technique for extracting articulatory information from a speech signal has been proposed in “A linguistic feature representation of the speech waveform” by Ellen Eide, J Robin Rohlicek, Herbert Gish and Sanjoy Mitter; International Conference on Acoustics, Speech and Signal Processing, April 1993, Minneapolis, USA, Vol. 2, pages 483-486. In this technique, a whole speech utterance, for example a sentence, is input into the speech analysis apparatus, the utterance then being segmented. This segmentation process uses a computationally intensive dynamic programming method that determines the most likely broad phonetic sequence within the utterance. Consequently, whilst this system allows analysis of the input speech to produce some indication of the positions of some of the articulators, delays are produced due to the necessity of inputting whole speech utterances before any analysis takes place.
U.S. Pat. No. 4,980,917 discloses an apparatus and method for determining the instantaneous values of a set of articulatory parameters. It achieves this by monitoring the incoming speech and selecting a frame of speech for further processing when the monitoring identifies a significant change in the energy of the input speech signal. The further processing includes a spectral analysis and a linear mapping function which maps the spectral coefficients from the spectral analysis into articulatory parameters. However, the system described in U.S. Pat. No. 4,980,917 does not process all the input speech, and those frames of input speech that are processed are treated as separate entities. In other words, the system does not use context information, i.e. it does not consider neighbouring frames, when it determines the articulatory parameter values.
SUMMARY OF THE INVENTION
An object of the present invention is to provide an alternative method and apparatus for determining articulatory information from the speech signal of a speaker.
According to a first aspect of the present invention there is provided an apparatus for continuously determining information representative of features of a speech production system from an input speech signal as it arrives.
According to a second aspect of the present invention there is provided an apparatus for extracting, from an input speech signal, information representative of features of the speech production system that generated the input speech signal, the apparatus comprising: memory means arranged to store preprogrammable information representative of training speech signals produced during a training session; dividing means arranged to divide the input speech signal into a succession of frames; defining means arranged to define a succession of segments by grouping consecutive frames having similar acoustic properties of interest into each segment; and extracting means arranged to extract, for each segment, said information representative of features of the speech production system in dependence upon the input speech signal within that segment and upon said preprogrammable information.
According to a third aspect of the present invention, there is provided a method for extracting, from an input speech signal, information representative of features of the speech production system that generated the input speech signal, the method comprising: the step of storing in a memory preprogrammable information representative of training speech signals produced during a training session; the step of dividing the input speech signal into a succession of frames; the step of defining a succession of segme
Canon Kabushiki Kaisha
Fitzpatrick ,Cella, Harper & Scinto
Opsasnick Michael N.
Tsang Fan
LandOfFree
Apparatus and method for normalizing an input speech signal does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Apparatus and method for normalizing an input speech signal, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus and method for normalizing an input speech signal will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2461617