Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
1999-03-04
2002-10-22
Knepper, David D. (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S222000
Reexamination Certificate
active
06470313
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to speech coding and more particularly to the coding of speech signals in discrete time subframes containing digitised speech samples. The present invention is applicable in particular, though not necessarily, to variable bit-rate speech coding.
BACKGROUND OF THE INVENTION
In Europe, the accepted standard for digital cellular telephony is known under the acronym GSM (Global System for Mobile communications). A recent revision of the GSM standard has resulted in the specification of a new speech coding algorithm (or codec) known as Enhanced Full Rate (EFR). As with conventional speech codecs, EFR is designed to reduce the bit-rate required for an individual voice or data communication. By minimising this rate, the number of separate calls which can be multiplexed onto a given signal bandwidth is increased.
A very general illustration of the structure of a speech encoder similar to that used in EFR is shown in
FIG. 1. A
sampled speech signal is divided into 20 ms frames x, each containing 160 samples. Each sample is represented digitally by 16 bits. The frames are encoded in turn by first applying them to a linear predictive coder (LPC)
1
which generates for each frame a set of LPC coefficients a. These coefficients are representative of the short term redundancy in the frame.
The output from the LPC
1
comprises the LPC coefficients a and a residual signal r
1
produced by removing the short term redundancy from the input speech frame using a LPC analysis filter. The residual signal is then provided to a long term predictor (LTP)
2
which generates a set of LTP parameters b which are representative of the long term redundancy in the residual signal r
1
, and also a residual signal s from which the long term redundancy is removed. In practice, long term prediction is a two stage process, involving (1) a first open loop estimate long term prediction is a two stage process, involving (1) a first open loop estimate of a set of LTP parameters for the entire frame and (2) a second closed loop refinement of the estimated parameters to generate a set of LTP parameters for each 40 sample subframe of the frame. The residual signal s provided by LTP
2
is in turn filtered through filters 1/A(z) and W(z) (shown commonly as block
2
a
in
FIG. 1
) to provide a weighted residual signal {tilde over (s)}. The first of these filters is an LPC synthesis filter whilst the second is a perceptual weighting filter emphasising the “formant” structure of the spectrum. Parameters for both filters are provided by the LPC analysis stage (block
1
).
An algebraic excitation codebook
3
is used to generate excitation (or innovation) vectors c. For each 40 sample subframe (four subframes per frame), a number of different “candidate” excitation vectors are applied in turn, via a scaling unit
4
, to a LTP synthesis filter
5
. This filter
5
receives the LTP parameters for the current subframe and introduces into the excitation vector the long term redundancy predicted by the LTP parameters. The resulting signal is then provided to a LPC synthesis filter
6
which receives the LPC coefficients for successive frames. For a given subframe, a set of LPC coefficients are generated using frame to frame interpolation and the generated coefficients are in turn applied to generate a synthesized signal ss.
The encoder of
FIG. 1
differs from earlier Code Excited Linear Prediction (CELP) encoders which utilise a codebook containing a predefined set of excitation vectors. The former type of encoder instead relies upon the algebraic generation and specification of excitation vectors (see for example WO9624925) and is sometimes referred to as an Algebraic CELP or ACELP. More particularly, quantised vectors d(i) are defined which contain 10 non-zero pulses. All pulses can have the amplitudes +1 or −1. The 40 sample positions (i=0 to 39) in a subframe are divided into 5 “tracks”, where each track contains two pulses (i.e. at two of the eight possible positions), as shown in the following table.
TABLE 1
Potential positions of individual pulses in the algebraic codebook.
Track
Pulse
positions
1
i
0
, i
5
0, 5, 10, 15, 20, 25, 30, 35
2
i
1
, i
6
1, 6, 11, 16, 21, 26, 31, 36
3
i
2
, i
7
2, 7, 12, 17, 22, 27, 32, 37
4
i
3
, i
8
3, 8, 13, 18, 23, 28, 33, 38
5
i
4
, i
9
4, 9, 14, 19, 24, 29, 34, 39
Each pair of pulse positions in a given track is encoded with 6 bits (i.e. 3 bits for each pulse giving a total of 30 bits), whilst the sign of the first pulse in the track is encoded with 1 bit (a total of 5 bits). The sign of the second pulse is not specifically encoded but rather is derived from its position relative to the first pulse. If the sample position of the second pulse is prior to that of the first pulse, then the second pulse is defined as having the opposite sign to the first pule, otherwise both pulses are defined as having the same sign. All of the 3-bit pulse positions are Gray coded in order to improve robustness against channel errors, allowing the quantised vectors to be encoded with a 35-bit algebraic code u.
In order to generate the excitation vector c(i), the quantised vector d(i) defined by the algebraic code u is filtered through a pre-filter F
E
(z) which enhances special spectral components in order to improve synthesized speech quality. The pre-filter (sometimes known as a “colouring” filter) is defined in terms of certain of the LTP parameters generated for the subframe.
As with the conventional CELP encoder, a difference unit
7
determines the error between the synthesized signal and the input signal on a sample by sample basis (and subframe by subframe). A weighting filter
8
is then used to weight the error signal to take account of human audio perception. For a given subframe, a search unit
9
selects a suitable excitation vector {c(i) where i=0 to 39}, from the set of candidate vectors generated by the algebraic codebook
3
, by identifying the vector which minimises the weighted mean square error. This process is commonly known as “vector quantisation”.
As already noted, the excitation vectors are multiplied at the scaling unit
4
by a gain g
c
. A gain value is selected which results in the scaled excitation vector having an energy equal to the energy of the weighted residual signal {tilde over (s)} provided by the LTP
2
. The gain is given by:
g
c
=
s
~
T
⁢
Hc
⁡
(
i
)
c
⁡
(
i
)
T
⁢
H
T
⁢
Hc
⁡
(
i
)
(
1
)
where H is the linear prediction model (LTP and LPC) impulse response matrix.
It is necessary to incorporate gain information into the encoded speech subframe, together with the algebraic code defining the excitation vector, to enable the subframe to be accurately reconstructed. However, rather than incorporating the gain g
c
directly, a predicted gain ĝ
c
is generated in a processing unit
10
from previous speech subframes, and a correction factor determined in a unit
11
, i.e.:
&ggr;
gc
=g
c
/ĝ
c
(2)
The correction factor is then quantised using vector quantisation with a gain correction factor codebook comprising 5-bit code vectors. It is the index vector v
&ggr;
identifying the quantised gain correction factor {circumflex over (&ggr;)}
gc
which is incorporated into the encoded frame. Assuming that the gain g
c
varies little from frame to frame, &ggr;
gc
≅1 and can be accurately quantised with a relatively short codebook.
In practice, the predicted gain ĝ
c
is derived using a moving average (MA) prediction with fixed coefficients. A 4th order MA prediction is performed on the excitation energy as follows. Let E(n) be the mean-removed excitation energy (in dB) at subframe n, given by:
E
⁡
(
n
)
=
10
⁢
⁢
log
⁢
⁢
(
1
N
⁢
g
c
2
⁢
∑
i
=
0
N
-
1
⁢
c
2
⁡
(
i
)
)
-
E
_
(
3
)
where N=40 is the subframe size, c(i) is the excitation vector (including pre-filtering), and {overscore (E)}=36 dB is a predetermined mean of the typical excitation energy.
The energy for the subframe n
Knepper David D.
Nokia Mobile Phones Ltd.
Perman & Green LLP
LandOfFree
Speech coding does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Speech coding, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech coding will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2949775