Systems and methods for encoding and decoding speech for...

Multiplex communications – Pathfinding or routing – Combined circuit switching and packet switching

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S207000

Reexamination Certificate

active

06389006

ABSTRACT:

FIELD OF THE INVENTION
The present relates to systems and methods for transmitting speech and voice over a packet data network.
BACKGROUND OF THE INVENTION
Packet data networks send packets of data from one computer to another. They can be configured as local area networks (LANs) or as wide area networks (WANs). One example of the latter is the Internet.
Each packet of data is separately addressed and sent by the transmitting computer. The network routes each packet separately and thus, each packet might take a different amount of time to arrive at the destination. When the data being sent is part of a file which will not be touched until it has completely arrived, the varying delays is of no concern.
However, files and email messages are not the only type of data sent on packet data networks. Recently, it has become possible to also send real-time voice signals, thereby providing the ability to have voice conversations over the networks. For voice conversations, the voice data packets are played shortly after they are received which becomes difficult if a data packet is significantly delayed. For voice conversations, a packet which arrives very late is equivalent to being lost. On the Internet, 5%-25% of the packets are lost and, as a result, Internet phone conversations are often very choppy.
One solution is to increase the delay between receiving a packet and playing it, thereby allowing late packets to be received. However, if the delay is too large, the phone conversation becomes awkward.
Standards for compressing voice signals exist which define how to compress (or encode) and decompress (e.g. decode) the voice signal and how to create the packet of compressed data. The standards also define how to function in the presence of packet loss.
Most vocoders (systems which encode and decode voice signals) utilize already stored information regarding previous voice packets to interpolate what the lost packet might sound like. For example,
FIGS. 1A
,
1
B and
1
C illustrate a typical vocoder and its operation, where
FIG. 1A
illustrates the encoder
10
,
FIG. 1B
illustrates the operation of a pitch processor and
FIG. 1C
illustrates the decoder
12
. Examples of many commonly utilized methods are described in the book by Sadaoki Furui,
Digital Speech Processing, Synthesis and Recognition,
Marcel Dekker Inc., New York, N.Y., 1989. This book and the articles in its bibliography are incorporated herein by reference.
The encoder
10
receives a digitized frame of speech data and includes a short term component analyzer
14
, such as a linear prediction coding (LPC) processor, a long term component analyzer
16
, such as a pitch processor, a history buffer
18
, a remnant excitation processor
20
and a packet creator
17
. The LPC processor
14
determines the spectral coefficients (e.g. the LPC coefficients) which define the spectral envelope of each frame and, using the spectral coefficients, creates a noise shaping filter with which to filter the frame. Thus, the speech signal output of the LPC processor
14
, a “residual signal”, is generally devoid of the spectral information of the frame. An LPC converter
19
converts the LPC coefficients to a more transmittable form, known as “LSP” coefficients.
The pitch processor
16
analyses the residual signal which includes therein periodic spikes which define the pitch of the signal. To determine the pitch, pitch processor
16
correlates the residual signal of the current frame to residual signals of previous frames produced as described hereinbelow with respect to FIG.
1
B. The offset at which the correlation signal has the highest value is the pitch value for the frame. In other words, the pitch value is the number of samples prior to the start of the current frame at which the current frame best matches previous frame data. Pitch processor
16
then determines a long-term prediction which models the fine structure in the spectra of the speech in a subframe, typically of 40-80 samples. The resultant modeled waveform is subtracted from the signal in the subframe thereby producing a “remnant” signal which is provided to remnant excitation processor
20
and is stored in the history buffer
18
.
FIG. 1B
schematically illustrates the operation of pitch processor
16
where the residual signal of the current frame is shown to the right of a line
11
and data in the history buffer is shown to its left. Pitch processor
16
takes a window
13
of data of the same length as the current frame and which begins P samples before line
11
, where P is the current pitch value to be tested and provides window
13
to an LPC synthesizer
15
.
If the pitch value P is less than the size of a frame, there will not be enough history data to fill a frame. In this case, pitch processor
16
creates window
13
by repeating the data from the history buffer until the window is full.
Synthesizer
15
then synthesizes the residual signal associated with the window
13
of data by utilizing the LPC coefficients. Typically, synthesizer
15
also includes a format perceptual weighting filter which aids in the synthesis operation. The synthesized signal, shown at
21
, is then compared to the current frame and the quality of the difference signal is noted. The process is repeated for a multiplicity of values of pitch P and the selected pitch P is the one whose synthesized signal is closest to the current residual signal (i.e. the one which has the smallest difference signal).
The remnant excitation processor
20
characterizes the shape of the remnant signal and the characterization is provided to packet creator
17
. Packet creator
17
combines the LPC spectral coefficients, the pitch value and the remnant characterization into a packet of data and sends them to decoder
12
(FIG.
1
C), which includes a packet receiver
25
, a selector
22
, an LSP converter
24
, a history buffer
26
, a summer
28
, an LPC synthesizer
30
and a post-filter
32
.
Packet receiver
25
receives the packet and separates the packet data into the pitch value, the remnant signal and the LSP coefficients. LSP converter
24
converts the LSP coefficients to LPC coefficients.
History buffer
26
stores previous residual signals up to the present moment and selector
22
utilizes the pitch value to select a relevant window of the data from history buffer
26
. The selected window of the data is added to the remnant signal (by summer
28
) and the result is stored in the history buffer
26
, as a new signal. The new signal is also provided to LPC synthesis unit
30
which, using the LPC coefficients, produces a speech waveform. Post-filter
32
then distorts the waveform, also using the LPC coefficients, to reproduce the input speech signal in a way which is pleasing to the human ear.
In the G.723 vocoder standard of the International Telephone Union (ITU) remnants are interpolated in order to reproduce a lost packet. The remnant interpolation is performed in two different ways, depending on the state of the last good frame prior to the lost, or erased, frame. The state of the last good frame is checked with a voiced/unvoiced classifier.
The classifier is based on a cross-correlation maximization function. The last 120 samples of the last good frame (“vector”) are cross correlated with a drift of up to three samples. The index which reaches the maximum correlation value is chosen as the interpolation index candidate. Then, the prediction gain of the best vector is tested. If its gain is more than 2 dB, the frame is declared as voiced. Otherwise, the frame is declared as unvoiced.
The classifier returns 0 for the unvoiced case and the estimated pitch value for the voiced case. If the frame was declared unvoiced, an average gain is saved. If the current frame is marked as erased and the previous frame is classified as unvoiced, the remnant signal for the current frame is generated using a uniform random number generator. The random number generator output is scaled using the previously computed gain value.
In the voiced case, the current frame is regenerated with periodic e

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Systems and methods for encoding and decoding speech for... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Systems and methods for encoding and decoding speech for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Systems and methods for encoding and decoding speech for... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2900774

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.