Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
1998-09-08
2001-07-03
Korzuch, William R. (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S229000, C704S240000, C704S243000
Reexamination Certificate
active
06256607
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention is related to the field of efficient numerical encoding of physical data for use in an automatic recognition system. A particular application is the field of speech encoding for storage or transmission and for recognition. The invention addresses problems of efficient numerical encoding of physically derived data and efficient computation of likelihood scores during automatic recognition.
A high level of detailed technical and mathematical skill is common of practitioners in the art. This application presumes familiarity with known techniques of speech recognition and related techniques of numerically encoding physical data, including physical waveform data. This application briefly reviews some basic types of prior art encoding and recognition schemes in order to make the description of the invention understandable. This review should not be seen as comprehensive, and the reader is referred to the references cited herein as well as to other prior art documents. This review also should not be seen as limiting the invention to the particular examples and techniques described herein and in no case should the invention be limited except as described in the attached claims and all allowable equivalents.
Two earlier co-assigned U.S. applications, 08/276,742 now U.S. Pat. No. 5,825,978 issued Oct. 20, 1998 entitled METHOD AND APPARATUS FOR SPEECH RECOGNITION USING OPTIMIZED PARTIAL MIXTURE TYING
(287-41)
and 08/375,908 now U.S. Pat. No. 5,864,810 issued Jan. 26, 1999 entitled METHOD AND APPARATUS FOR ADAPTING A SPEECH RECOGNIZER TO A PARTICULAR SPEAKER
(287-40)
, discuss techniques useful in speech encoding and recognition and are fully incorporated herein by reference.
For purposes of clarity, this discussion refers to devices, concepts, and methods in terms of specific examples. However, the method and apparatus of the present invention may operate with a wide variety of types of digital devices including devices different from the specific examples described below. It is therefore not intended that the invention be limited except as provided in the attached claims.
Basics of Encoding and Recognition
FIGS. 1A and 1B
illustrate a basic process for encoding physical data, such as speech wave form data, into numerical values and then performing vector quantization (VQ) on those values. A physical signal
2
is sampled at some interval. For speech data, the interval is generally defined by a unit of time t, which in an example system is 10 milliseconds (ms). A signal processor
5
receives the physical data and generates a set of numerical values representing that data. In some known speech recognition systems, a cepstral analysis is preformed and an observed vector Xt (
10
) consisting of a set of cepstral values (C
1
to C
13
) is generated for each interval of time t. In one system, each of the 13 values is a real number and may be represented in a digital computer as 32 bits. Thus, in this specific example, each 10 ms interval of speech (sometimes referred to as a frame) is encoded as an observed vector X of thirteen 32-bit values or 416 bits of data. Other types of signal processing are possible, such as, for example, where the measured interval does not represent time, where the measurement of the interval is different, where more or fewer values or different values are used to represent the physical data, where cepstral coefficients are not used, where cepstral values and their first and/or second derivatives are also encoded, or where different numbers of bits are used to encode values. In speech encoded for audio playback, rather than recognition, different coefficients are typically used.
In some systems, the Xt vectors may be used directly to transmit or store the physical data, or to perform recognition or other types of processing. In the system just described, transmission would require 416*100 bits per second (bps) or 41.6 kbps of continuous transmission time. Recognition system based on original full cepstral vectors often employ Continuous Density Hidden Markov Models (CDHMMs), possibly with the probability functions of each model approximated by mixtures of Gaussians. The 08/276,742 patent, incorporated above, discussed a method for sharing mixtures in such a system to enhance performance.
However, often it is desirable to perform further encoding of the vectors in order to reduce the number of bits needed to represent the vectors and in order to simplify further processing. One known method for doing this is called vector quantization, a type of which is shown in FIG.
1
B.
Vector quantization (VQ) takes advantage of the fact that in most physical systems of interest, the values (C
1
to C
13
) that make up a particular vector Xt are not independent but instead have a relationship one to another, such that the individual value of C
3
for example, will have some non-random correlation to other values in that vector.
VQ also takes advantage of the fact that in most physical systems of interest, not all possible vectors will be observed. When encoding human speech in a particular language for example, many ranges of vector values (representing sounds that are not part of human speech) will never be observed, while other ranges of vectors will be common. Such relationships can be understood geometrically by imagining a continuous 13-dimensional space, which, though hard to visualize, shares many properties with real 3-dimensional space. In this continuous 13-dimensional space, every possible vector X will represent a point in the space. If one were to measure a large number of X vectors for a physical system of interest, such as human speech in a particular language, and plot a point for each measured X, the points plotted in space would not be evenly or randomly distributed, but would instead form distinct clusters. Areas of space that represented common sounds in human speech would have many points while areas of space that represented sounds that were never part of human speech would have no points.
In standard VQ, an analogous procedure is used to plot clusters and use those clusters to divide the space into a finite number of volumes. In the 13-dimensional example described above, a sample of human speech data is gathered, processed, and plotted in the 13-dimensional space and 13-dimensional volumes are drawn around dense clusters of points. The size and shape of a particular volume may be determined by the density of points in a particular region. In many systems, a predetermined number of volumes, such as 256, are drawn in the space in such a way as to completely fill the space. Each volume is assigned an index number (also referred to as a codeword) and a “central” point (or centroid) is computed for each volume, either geometrically from the volume or taking into account the actual points plotted and finding a central point. The codewords, the descriptions of the volumes to which they relate, and the centroids to which they are mapped, are sometimes referred to in the art as a codebook. Some systems use multiple codebooks, using a separate codebook for each feature that is quantized. Some systems also use different codebooks for different speakers or groups of speakers, for example using one codebook or set of codebooks for male speakers and another for female speakers.
Once the volumes are determined from training data, new speech data may be encoded by mathematically plotting the 13 value vector in the 13-dimensional space, determining which volume the point falls in (or which centroid the point is closest to) and storing for that point the VQ index value (in one example, simply an 8-bit number from 0 to 255) for that volume, thus Xt is encoded as VQt. When it is time to unencode the data, the 8-bit VQ is used to look-up the centroid for that volume and the (416-bit) value of the centroid can be used as an approximation of the actual observed vector Xt. First and second derivatives can be computed from these decoded centroids or those values can initially be encoded and stored similarly to the centroids
Digalakis Vassilios
Neumeyer Leonardo
Perakakis Manolis
Tsakalidis Stavros
Allen Kenneth R.
Korzuch William R.
Lerner Martin
SRI - International
Townsend and Townsend / and Crew LLP
LandOfFree
Method and apparatus for automatic recognition using... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for automatic recognition using..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for automatic recognition using... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2548097