Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-01-04
2001-08-21
Dorvil, Richemond (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S254000, C704S253000
Reexamination Certificate
active
06278972
ABSTRACT:
BACKGROUND OF THE INVENTION
I. Field of the Invention
The present invention relates generally to speech recognition. More particularly, the present invention relates to a system and method for segmentation of speech signals for purposes of speech recognition.
II. Description of the Related Art
Pattern recognition techniques have been widely used in speech recognition. The basic idea in the technique is to compare the input speech pattern with a set of templates, each of which represents a pre-recorded speech pattern in a vocabulary. The recognition result is the word in the vocabulary associated with the template which has the most similar speech pattern to that of the input speech pattern.
For human beings, it is usually not necessary to hear all the detail in an utterance (e.g., a word) in order to recognize the utterance. This fact shows that there are some natural redundancies inherent in speech. Many techniques have been developed to recognize speech taking advantage of such redundancies. For example, U.S. Pat. No. 5,056,150 to Yu et al. discloses a real time speech recognition system wherein a nonlinear time-normalization method is used to normalize a speech pattern to a predetermined length by only keeping spectra with significant time-dynamic attributes. Using this method, the speech pattern is compressed significantly, although it may occasionally keep the same spectrum repeatedly.
Another technique for speech recognition employs a sequence of acoustic segments, which represent a sequence of spectral frames. The segments are the basic speech units upon which speech recognition is based. One procedure for generating the acoustic segments, or performing segmentation, is to search for the most probable discontinuity points in the spectral sequence using a dynamic programming method. These selected points are used as the segment boundaries. See J. Cohen, “Segmenting Speech Using Dynamic Programming,” J. Acoustic Soc. of America, May 1981, vol. 69(5), pp. 1430-1437. This technique, like the technique of U.S. Pat. No. 5,056,150 described above, is based on the searching of significant time-dynamic attributes in the speech pattern.
Another technique used to segment speech is based on the segmental K-means training procedure. See L. R. Rabiner et al., “A Segmental K-means Training Procedure for Connected Word Recognition,” AT&T Technical Journal, May/June 1986 Vol. 65(3), pp. 21-31. Using an iterative training procedure, an utterance is segmented into words or subword units. Each of the units is then used as a speech template in a speech recognition system. The iterative training procedure requires many steps of computation, so that it cannot be implemented in real time.
These problems and deficiencies are recognized and solved by the present invention in the manner described below.
SUMMARY OF THE INVENTION
The present invention is directed to a system and method for forming a segmented speech signal from an input speech signal having a plurality of frames. The segmented speech signal provides a template upon which speech recognition is based. First, the input speech signal is converted to a frequency domain signal having a plurality of speech frames, wherein each speech frame of the frequency domain signal is represented by at least one but usually multiple spectral values associated with the speech frame. The spectral values are generally chosen to encapsulate the acoustic content of the speech frame. A spectral difference value is then determined for each pair of adjacent frames of the frequency domain signal. The spectral difference value represents a difference between the spectral values for the pair of adjacent frames. The spectral difference value is indicative of the time-dynamic attributes between the frames. An initial cluster boundary is set between each pair of adjacent frames in the frequency domain signal, and a variance value is assigned to each single-frame cluster in the frequency domain signal, wherein the variance value for each single-frame cluster is equal to the corresponding spectral difference value. Next, a cluster merge parameter is calculated for each pair of adjacent clusters. The cluster merge parameter is computed based on the spectral difference values of the adjacent clusters. A minimum cluster merge parameter is selected from the plurality of cluster merge parameters. The minimum merge parameter is indicative of the most insignificant time-dynamic attribute. A merged cluster is then formed by canceling a cluster boundary between the clusters associated with the minimum merge parameter and assigning a merged variance value to the merged cluster, wherein the merged variance value is representative of the variance values assigned to the clusters associated with the minimum merge parameter. The process is repeated in order to form a plurality of merged clusters, and the segmented speech signal is formed in accordance with the plurality of merged clusters.
REFERENCES:
patent: 4813074 (1989-03-01), Marcus
patent: 5056150 (1991-10-01), Yu et al.
patent: 5832425 (1998-11-01), Mead
patent: 0831455 (1998-03-01), None
Zue, V et al “Acoustic segmentation and phonetic classification in the Summit system” IEEE, 1989, p. 389-392.*
Glass, J.R. et al “Multi-level acoustic segmentation of continuous speech” IEEE, 4/88, p. 429-432.*
Wilcox, L et al “Segmentaion of speech using speaker identification” IEEE, 4/94, p. 161-164.*
Rabiner, L.R. “Fundamentals of speech recognition” 1993, Prentice Hall, p. 242-250, 263.*
1981 J. Acoustic Soc. Of America, vol. 69(5), “Segmenting Speech Using Dynamic Programming”, J. Cohen, pp. 1430-1438.
1986 AT&T Technical Journal, vol. 65(3), “A Segmental K-Means Training Procedure for Connected Word Recognition”, L.R. Rabiner et al., pp. 21-31.
Pauws, et al. “A Hierarchical Method of Automatic Speech Segmentation for Synthesis Applications” Speech Communications 19:207-220 (1996).
Bi Ning
Chang Chienchung
Baker Kent D.
Dorvil Richemond
Qualcomm Incorporated
Rouse Thomas R.
Wadsworth Philip R.
LandOfFree
System and method for segmentation and recognition of speech... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for segmentation and recognition of speech..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for segmentation and recognition of speech... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2529126