Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
2000-02-16
2004-05-04
Dorvil, Richemond (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S220000
Reexamination Certificate
active
06732070
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to the field of coding and decoding synthesized speech. More particularly, the present invention relates to such coding and decoding of wideband speech.
BACKGROUND OF THE INVENTION
Abbreviations
A-b-S
Analysis-by-synthesis
CELP
Code excited linear prediction
HB
Higher band
LB
Lower band
LP
Linear prediction
LPC
Linear predictive coding
WB
Wideband
LSP
Line spectral pair
Definitions and Terminology
wideband signal: Signal that has a sampling rate of F
s
wide
, often having a value of 16 kHz.
lower band signal: Signal that contains frequencies from 0.0 Hz to 0.5F
s
lower
from the corresponding wideband signal and has the sampling rate of F
s
lower
, for example 12 kHz, which is smaller than F
s
wide
.
higher band signal: Signal that contains frequencies from 0.5F
s
lower
to 0.5F
s
wide
from the corresponding wideband signal and has the sampling rate of F
s
higher
, for example 4 KHz, and usually F
s
wide
=F
s
lower
+F
s
higher
.
residual: The output signal resulting from an inverse filtering operation.
excitation search: A search of codebooks for an excitation signal or a set of excitation signals that substantially match a given residual. The output of an excitation search process, conducted by an analysis-by-synthesis module, are parameters (codewords) that describe the excitation signal or set of excitation signals that are found to match the residual. The parameters include two code vectors, one from an adaptive codebook, which includes excitations that are adapted for every subframe, and one from a fixed codebook, which includes a fixed set of excitations, i.e. non-adapted.
x(n) A residual signal (innovation), i.e. a target signal for adaptive codebook search.
exc(n) An excitation signal intended to match the residual x(n).
A(z) The inverse filter with unquantized coefficients. The inverse filter removes short-term correlation from a speech signal. It models an inverse frequency response of the vocal tract of a (real or imagined) speaker.
Â(z) The inverse filter with quantified (quantized) coefficients.
H(z)=1/Â(z) A speech synthesis filter with quantified coefficients.
frame: A time interval usually equal to 20 ms (corresponding to 160 samples at an 8 kHz sampling rate). LP analysis is performed frame by frame.
subframe: A time interval usually equal to 5 ms (corresponding to 40 samples at an 8 kHz sampling rate). Excitation searching is performed subframe by subframe.
s(n) An original speech signal (to be encoded).
s′(n) A windowed speech signal.
ŝ(n) A reconstructed (by a decoder) speech signal.
h(n) The impulse response of an LP synthesis filter.
LSP a line spectral pair, i.e. the transformation of LPC parameters. Line spectral pairs are obtained by decomposing the inverse filter transfer function A(z) into a set of two transfer functions, each a polynomial, one having even symmetry and the other having odd symmetry. The line spectral pairs are the roots of these polynomials on a z-unit circle. A set of LSP indices are used as one representation of an LP filter.
T
ol
Open-loop lag (associated with a pitch period, or a multiple or sub-multiple of a pitch period).
R
w
[] Correlation coefficients that are used as a representation of an LP filter.
LP coefficients: Generic term for describing short-term synthesis filter coefficients.
short term synthesis filter: A filter that adds to an excitation signal a short-term correlation that models the impulse response of a vocal tract.
perceptual weighting filter: A filter used in an analysis by synthesis search of codebooks. It exploits the noise-masking properties of formants (vocal tract resonances) by weighting the error less near the formant frequencies.
zero-input response: The output of a synthesis filter due to past inputs but no present input, i.e. due solely to the present state of a filter resulting from past inputs.
Discussion
Many methods of coding speech today are based upon linear predictive (LP) coding, which extracts perceptually significant features of a speech signal directly from a time waveform rather than from a frequency spectra of the speech signal (as does what is called a channel vocoder or what is called a formant vocoder). In LP coding, a speech waveform is first analyzed (LP analysis) to determine a time-varying model of the vocal tract excitation that caused the speech signal, and also a transfer function. A decoder (in a receiving terminal in case the coded speech signal is telecommunicated) then recreates the original speech using a synthesizer (for performing LP synthesis) that passes the excitation through a parameterized system that models the vocal tract. The parameters of the vocal tract model and the excitation of the model are both periodically updated to adapt to corresponding changes that occurred in the speaker as the speaker produced the speech signal. Between updates, i.e. during any specification interval, however, the excitation and parameters of the system are held constant, and so the process executed by the model is a linear time-invariant process. The overall coding and decoding (distributed) system is called a codec.
In a codec using LP coding, to generate speech, the decoder needs the coder to provide three inputs: a pitch period if the excitation is voiced; a gain factor; and predictor coefficients. (In some codecs, the nature of the excitation, i.e. whether it is voiced or unvoiced, is also provided, but is not normally needed in case of for example an ACELP codec.) LP coding is predictive in that it uses prediction parameters based on the actual input segments of the speech waveform (during a specification interval) to which the parameters are applied, in a process of forward estimation.
Basic LP coding and decoding can be used to digitally communicate speech with a relatively low data rate, but it produces synthetic sounding speech because of its using a very simple system of excitation. A so-called code excited linear predictive (CELP) codec is an enhanced excitation codec. It is based on “residual” encoding. The modeling of the vocal tract is in terms of digital filters whose parameters are encoded in the compressed speech. These filters are driven, i.e. “excited,” by a signal that represents the vibration of the original speaker's vocal cords. A residual of an audio speech signal is the (original) audio speech signal less the digitally filtered audio speech signal. A CELP codec encodes the residual and uses it as a basis for excitation, in what is known as “residual pulse excitation.” However, instead of encoding the residual waveforms on a sample-by-sample basis, CELP uses a waveform template selected from a predetermined set of waveform templates in order to represent a block of residual samples. A codeword is determined by the coder and provided to the decoder, which then uses the codeword to select a residual sequence to represent the original residual samples.
FIG. 1A
shows elements of a transmitter/encoder system and elements of a receiver/decoder system, the overall system serving as a codec, and based on an LP codec, which could be a CELP-type codec. The transmitter accepts a sampled speech signal s(n) and provides it to an analyzer that determines LP parameters (inverse filter and synthesis filter) for a codec. s(n) is the inverse filtered signal used to determine the residual x(n). The excitation search module encodes for transmission both the residual x(n), as a quantified or quantized error x
q
(n), and the synthesizer parameters and applies them to a communication channel leading to the receiver. On the receiver (decoder system) side, a decoder module extracts the synthesizer parameters from the transmitted signal and provides them to a synthesizer. The decoder module also determines the quantified error x
q
(n) from the transmitted signal. The output from the synthesizer is combined with the quantified error x
q
(n) to produce a quantified value s
q
(n) representing the original speech signal s(n).
A transmitter and receiver using a CELP-type codec function
Mikkola Hannu
Rotola-Pukkila Jani
Vainio Janne
Azad Abul K.
Dorvil Richemond
Nokia Mobile Phones Ltd.
Ware Fressola Van Der Sluys & Adolphson LLP
LandOfFree
Wideband speech codec using a higher sampling rate in... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Wideband speech codec using a higher sampling rate in..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Wideband speech codec using a higher sampling rate in... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3263975