Multiple voice tracking system and method

Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S202000, C704S207000, C704S206000, C381S094300

Reexamination Certificate

active

06453284

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates to a system and method for tracking individual voices in a group of voices through time so that the spoken message of an individual may be selected and extracted from the sounds of the other competing talker's voices.
When listeners (whether they be human or machine) attempt to identify a single taker's speech sounds that are imbedded in a mixture of sounds spoken by other takers, it is often very difficult to identify the specific sounds produced by the target talker. In this instance, the signal that the listener is trying to identify and the “noise” the listener is trying to ignore have very similar spectral and temporal properties. Thus, simple filtering techniques to remove the noise are not able to remove only the unwanted noise without also removing the intended signal
Examples of situations where this poses a significant problem include operation of voice recognition software and hearing aids in noisy environments where multiple voices are present. Both hearing-impaired human listeners and machine speech recognition systems exhibit considerable speech identification difficulty in this type of multi-talker environment. Unfortunately, the only way to improve the speech understanding performance for these listeners is to identify the talker of interest and isolate just this voice from the mixture of competing voices. For stationary sounds, this may be possible. However, fluent speech exhibits rapid changes over relatively short time periods. To separate a single talker's voice from the background mixture, there must therefore exist a mechanism that tracks each individual voice through time so that the unique sounds and properties of that voice may be reconstructed and presented to the listener. While there are currently available several models and mechanisms for speech extraction, none of these systems specifically attempt to put together the speech sounds of each individual talker as they occur through time.
SUMMARY OF THE INVENTION
To solve the foregoing problem, the present invention provides a system and method for tracking each of the individual voices in a multi-talker environment so that any of the individual voices may be selected for additional processing. The solution that has been developed is to estimate the fundamental frequencies of each of the voices present using a conventional analysis method and, then follow the trajectories of each individual voice through time using a neural network prediction technique. The result of this method is a time-series prediction model that is capable of tracking multiple voices through time, even if the pitch trajectories of the voices cross over one another, or appear to merge and then diverge.
In a preferred embodiment of the invention, the acoustic speech waveform comprised of the multiple voices to be identified is first analyzed to identify and estimate the fundamental frequency of each voice present in the waveform. Although this analysis can be carried out by using a frequency domain analysis technique, such as a Fast Fourier Transform (FFT), it is preferable to use a time domain analysis technique to increase processing speed, and decrease complexity and cost of the hardware or software employed to implement the invention. More preferably, the waveform is submitted to an average magnitude difference function (AMDF) calculation which subtracts successive time shifted segments of the waveform from the waveform itself As a person speaks, the amplitude of their voice oscillates at a fundamental frequency. As a result, because the AMDF calculation is subtractive, the pitch period of a particular voice will produce a small value near the frequency period F
0
of the voice since the AMDF at that point is effectively subtracting a value from itself After the AMDF is calculated, the F
0
of each voice present can then be estimated as the inverse of the AMDF minima.
Once the fundamental frequencies of the individual voices have been identified and estimated, the next step implemented by the system is to track the voices through time. This would be a simple matter if each voice was of a constant pitch, however, the pitch of an individual's voice changes slowly over time as they speak. In addition, when multiple people are simultaneously speaking, it is quite common for the pitches of their voices to cross over each other in frequency as one person's voice pitch is rising, while another's is falling. This makes it extremely difficult to track the individual voices accurately.
To solve this problem, the present invention tracks the voices through use of a recursive neural network that predicts how each voice's pitch will change in the future, based on past behavior. The recursive neural network predicts the F
0
value for each voice at the next windowed segment. Because the predicted values are constrained by the frequency values of prior analysis frames, the F
0
tracks tend to change smoothly, with no abrupt discontinuities in the trajectories. This follows what is normally observed with natural speech: the F
0
contours of natural speech do not change abruptly, but vary smoothly over time. In this manner, the neural network thus predicts the next time value of the F
0
for each talker's F
0
track.
The output from the neural network thus comprises tracking information for each of the voices present in the analyzed waveform This information can either be stored for future analysis, or can be used directly in real time by any suitable type of voice filtering or separating system for selective processing of the individual speech signals. For example, the system can be implemented in a digital signal processing chip within a hearing aid for selective amplification of an individual's voice. Although the neural network output can be used directly for tracking of the individual voices, the system can also use the AMDF calculation circuit to estimate the F
0
for each of the voices, and then use the neural network output to assign each of the AMDF-estimated F
0
's to the correct voice.


REFERENCES:
patent: 4292469 (1981-09-01), Scott et al.
patent: 4424415 (1984-01-01), Lin
patent: 4922538 (1990-05-01), Tchorzewski
patent: 5093855 (1992-03-01), Vollert et al.
patent: 5175793 (1992-12-01), Sakamoto et al.
patent: 5181256 (1993-01-01), Kamiya
patent: 5182765 (1993-01-01), Ishii et al.
patent: 5384833 (1995-01-01), Cameron
patent: 5394475 (1995-02-01), Ribic
patent: 5404422 (1995-04-01), Sakamoto et al.
patent: 5475759 (1995-12-01), Engebretson
patent: 5521635 (1996-05-01), Mitsuhashi et al.
patent: 5539806 (1996-07-01), Allen et al.
patent: 5581620 (1996-12-01), Brandstein et al.
patent: 5604812 (1997-02-01), Meyer
patent: 5636285 (1997-06-01), Sauer
patent: 5712437 (1998-01-01), Kageyama
patent: 5737716 (1998-04-01), Bergstrom et al.
patent: 5764779 (1998-06-01), Haranishi
patent: 5809462 (1998-09-01), Nussman
patent: 5812970 (1998-09-01), Chan et al.
patent: 5838806 (1998-11-01), Sigwanz et al.
patent: 5864807 (1999-01-01), Campbell et al.
patent: 6006175 (1999-12-01), Holzrichter
patent: 6130949 (2000-10-01), Aoki et al.
patent: 6192134 (2001-02-01), White et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Multiple voice tracking system and method does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Multiple voice tracking system and method, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multiple voice tracking system and method will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2859574

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.