Speech processing system using format analysis

Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S256000, C704S268000

Reexamination Certificate

active

06292775

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to a speech processing system, and more particularly to such a system which makes use of the resonant modes of the human vocal tract associated with speech sounds, these being known as the formant frequencies.
2. Discussion of Prior Art
Formant frequencies usually appear as peaks in the short-term spectrum of speech signals. For many years it has been recognised that they are closely related to the phonetic significance of the associated speech sounds. This relationship means that there are many applications in automatic processing of speech signals for which an effective method of formant frequency measurement would be useful, such as:
(a) Formant vocoders, ie devices for coding low-bit-rate speech transmissions;
(b) Visual display of formant frequency variation with time, to aid the deaf to interpret speech, or to assist in their speech training;
(c) Automatic authentication of identity from an individual's speech; and
(d) Speech signal analysis for input to an automatic speech recognition system.
The requirements of these applications could be met by determining the formant frequencies from a succession of spectral cross-sections at regular time intervals. In addition, it is also useful to determine the associated formant amplitudes because the phonetic quality of speech sounds depends on both. For some sounds (vowels in particular) the relative formant amplitudes are determined largely by the pattern of formant frequencies. However, the relative amplitudes for most consonants will be very different from those typical of vowels, and even for vowels they will vary with vocal effort and from speaker to speaker.
Unfortunately, in spite of the usefulness of formant information, automatic formant-frequency measurement is notoriously difficult. The primary cause of this difficulty arises because speech processing involves analysis of sounds of short duration to produce short-term spectral cross-section, but the spectral peaks which define the formants are not necessarily clearly apparent in such a cross-section. The acoustic theory of speech production shows that under ideal conditions the human vocal tract has a series of resonant modes at an average frequency spacing of about 1 kHz, the actual frequencies of the resonances being determined by the precise positions of the jaw, tongue, lips and other articulators at any particular time. The fact that the formants are inherently associated with acoustic resonances of the human vocal system means that their frequencies will normally change smoothly with time as the articulatory organs move to produce different speech sounds.
The influence of the formant frequencies in determining the phonetic properties of speech almost entirely relates to only the lowest three of these resonances (usually referred to as F
1
, F
2
and F
3
), and resonances above the third are of little importance. In fact resonances above F
4
are often not detectable in speech signals because of bandwidth limitation. In the case of telephone bandwidth signals even F
4
is often not present in the available signal.
There are many reasons why the elegant theory about speech production often does not yield a clear picture of the theoretical formants during real speech sounds. First, the theory treats the response of the vocal tract, and takes no account of the spectral properties of the sound sources which excite the tract. The main sound sources are air flow between the vibrating vocal folds, and turbulent noise caused by flow through a constriction in the vocal tract. Most of the time these sources have a spectral structure that is not likely to obscure the resonant pattern of the vocal tract response. The spectral trends of these sources as a function of frequency are either fairly flat (in the case of turbulent noise) or have a general decrease in intensity as frequency increases (in the case of flow between the vocal folds). However, in the latter case, particularly for some speakers, there will be occasions where the generally smooth spectral trend will be disturbed at some frequencies, sometimes with minor spectral peaks, but more usually with pronounced dips in the spectrum. If such a dip coincides with a vocal tract resonance, the expected spectral peak of that formant may be almost completely obscured.
The second reason for the difficulty of identifying formant peaks, particularly during some consonant sounds, is that there can be a severe constriction of the vocal tract at some intermediate point so that it is acoustically almost completely separated into two substantially independent sections. For these types of speech sound, the sound source is normally caused by air turbulence generated at the constriction. The sound radiated from the mouth in these circumstances is then influenced mainly by the resonant structure of the tract forward from the constriction, and the formants associated with the back cavity (notably F
1
) are so weakly excited that they are often not apparent at all in the radiated speech spectrum. In these cases F
1
has no perceptual significance, but it is advantageous to associate other resonances with appropriate higher formantnumbers from continuity considerations. The behaviour of formant frequencies as a function of time is described in terms of formant trajectories; each formant trajectory is a series of successive values of a respective individual formant frequency such as F
1
as a function of time. There is therefore a set of three formant trajectories for the formant frequencies F
1
, F
2
F
3
. Continuity considerations imply continuity of formant trajectories across vowel/consonant boundaries.
Turbulence-excited consonant sounds have a further difficulty for formant analysis because during these sounds the glottis (the space between the vocal folds, in the larynx) is open wide, so causing more damping of the formant resonances because of coupling into the sub-glottal system (the bronchi and lungs).
The third difficulty of formant analysis applies specifically to high-pitched speakers for which the frequency of vibration of the vocal folds may be fairly high, perhaps 400 Hz or even higher. This high frequency yields harmonics for which the spacing may be larger than the spectral bandwidth of the formant resonances. Thus a formant peak may lie between two harmonics and therefore not be obvious, and spectral peaks caused by harmonics may be mistaken for formants.
The fourth difficulty of formant analysis applies to nasalized sounds. The basic speech production theory does not apply to these sounds, because it is based on the response of an unbranched acoustic tube. In the presence of nasalization (either nasal consonants or nasalized vowels) the soft palate is lowered, and the nasal cavities become coupled with the vocal tract. The acoustic system then has a side branch, which introduces a complicated set of additional resonances and antiresonances into the response of the system. In these cases the simple description of a speech signal in terms of the three most important formants no longer strictly applies. However, some of the resonances of the vocal tract with nasal coupling are more prominent than others, and it is often possible to trace temporal continuity of these resonances into adjacent periods when nasalization is absent. It can therefore still be useful to describe nasal sounds also in terms of F
1
, F
2
and F
3
. Although the three-formant concept is still useful for nasal sounds, the more complicated acoustic system usually causes the resonances to be less prominent than in non-nasal sounds. It is thus often extremely difficult to decide, when looking at a spectral cross-section, what the formant frequencies should be.
Determination of the formant frequencies of speech sounds, particularly as features to use in automatic speech recognition, has been described by M. J. Hunt in “Delayed decisions in speech recognition—the case of formants”, Pattern Recognition Letters 6, 1987, pp. 121-137. Here initial speech signal processing was

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Speech processing system using format analysis does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Speech processing system using format analysis, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech processing system using format analysis will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2454764

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.