Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
1999-08-30
2003-09-09
Banks-Harold, Marsha D. (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S220000, C704S221000
Reexamination Certificate
active
06618699
ABSTRACT:
FIELD OF THE INVENTION
The invention relates generally to the field of speech signal processing, and more particularly, concerns formant tracking based on phoneme information in speech analysis.
BACKGROUND OF THE INVENTION
Various speech analysis methods are available in the field of speech signal processing. A particular method in the art is to analyze the spectrograms of particular segments of input speech. The spectrogram of a speech signal is a two-dimensional representation (time vs. frequency), where color or darkness of each point is used to indicate the amplitude of the corresponding frequency component. At a given time point, a cross section of the spectrogram along the frequency axis (spectrum) generally has a profile that is characteristic of the sound in question. In particular, for voiced sounds, such as vowels and vowel-like sounds, each has characteristic frequency values for several spectral peaks in the spectrum. For example, the vowel in the word “beak” is signified by spectral peaks at around 200 Hz and 2300 Hz. The spectral peaks are called the formants of the vowel and the corresponding frequency values are called the formant frequencies of the vowel. A “phoneme” corresponds to the smallest unit of speech sounds that serve to distinguish one utterance from another. For instance, in the English language, the phoneme lit corresponds to the sound for the “ea” in “beat.” It is widely accepted that the first two or three formant frequencies characterize the corresponding phoneme of the speech segment. A “formant trajectory” is the variation or path of particular formant frequencies as a function of time. When the formant frequencies are plotted as a function of time, their formant trajectories usually change smoothly inside phonemes corresponding to a vowel sound or between phonemes corresponding to such vowel sounds. This data is useful for applications such as text-to-speech generation (“TTS”) where formant trajectories are used to determine the best speech fragments to assemble together to produce speech from text input.
FIG. 1
is a diagram illustrating a conventional formant tracking method in which input speech
102
is first processed to generate formant trajectories for subsequent use in applications such as TTS. First, a spectral analysis is performed on input speech
102
(Step
104
) using techniques, such as linear predictive coding (LPC), to extract formant candidates
106
by solving the roots of a linear prediction polynomial. A candidate selection process
108
is then used to choose which of the possible formant candidates is the best to save as the final formant trajectories
110
. Candidate selection
108
is based on various criteria, such as formant frequency continuity.
Regardless of the particular criteria, conventional selection processes operate without reference to text data associated with the input speech. Only after candidate selection is complete are the final formant trajectories
110
correlated with input text
112
processed (formant data processing step
114
) to generate, e.g., an acoustic database that contains the processed results associating the final formant data with text phoneme information for later use in another application, such as TTS or voice recognition.
Conventional formant tracking techniques are prone to tracking errors and are not sufficiently reliable for unsupervised and automatic usage. Thus, human supervision is needed to monitor the tracking performance of the system by viewing the formant tracks in a larger time context with the aid of a spectrogram. Nonetheless, when only limited information is provided, even human-supervised systems can be as unreliable as conventional automatic formant tracking.
Accordingly, it would be advantageous to provide an improved formant tracking method that significantly reduces tracking errors and can operate reliably without the need for human intervention.
SUMMARY OF THE INVENTION
The invention provides an improved formant tracking method and system for selecting formant trajectories by making use of information derived from the text data that corresponds to the processed speech before final formant trajectories are selected. According to the invention, the input speech is analyzed in a plurality of time frames to obtain formant candidates for each time frame. The text data corresponding to the input speech is converted into a sequence of phonemes. The input speech is segmented by putting in temporal boundaries. The sequence of phonemes is aligned with a corresponding segment of the input speech. Predefined nominal formant frequencies are then assigned to a center point of each phoneme and this data is interpolated to provide target formant trajectories for each time frame. For each time frame, the formant candidates are compared with the target formant trajectories and candidates are selected according to one or more cost factors. The selected formant candidates are then output for storage or further processing in subsequent speech applications.
REFERENCES:
patent: 4424415 (1984-01-01), Lin
patent: 5204905 (1993-04-01), Mitome
patent: 5751907 (1998-05-01), Moebius et al.
patent: 2001/0021904 (2001-09-01), Plumpe
Hunt, “A Robust Formant-Based Speech Spectrum Comparison Measure,” Proceedings of ICASSP, pp. 1117-1120, 1985, vol. 3.*
Laprei et al., “A new paradigm for reliable automatic formant tracking,” Proceedings of ICASSP, pp. 19-22, Apr. 1994, vol. 2.*
Rabiner, “Fundamentals of Speech Recognition,” Prentice Hall, 1993, pp. 95-97.*
Schmid, “Explicit N-Best Formant Features for Seqment-Based Speech Recognition,” a dissertation submitted to the Oregon Graduate Institute of Science & Technology, Oct. 1996.*
Sun, “Robust Estimation of Spectral Center-of-Gravity Trajectories Using Mixture Spline Models,” Proceedings of the 4th European Conference on Speech Communication and Technology Madrid, Spain, pp. 749-752, 1995.*
Lee, Minkyu et al., “Formant Tracking Using Segmental Phonemic Information”, Presentation given at Eurospeech '99, Budapest, Hungary, Sep. 9, 1999.
Lee Minkyu
Moebius Bernd
Olive Joseph Philip
Van Santen Jan Pieter
Banks-Harold Marsha D.
Harper V Paul
Lucent Technologies - Inc.
LandOfFree
Formant tracking based on phoneme information does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Formant tracking based on phoneme information, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Formant tracking based on phoneme information will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3059525