Data processing: speech signal processing – linguistics – language – Audio signal time compression or expansion
Reexamination Certificate
1998-09-26
2001-07-24
Dorvil, Richemond (Department: 2741)
Data processing: speech signal processing, linguistics, language
Audio signal time compression or expansion
C704S200100, C704S500000
Reexamination Certificate
active
06266644
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to audio encoding and decoding and more particularly, to apparatus and methods for encoding, storing, transferring, receiving, modifying and decoding audio data.
BACKGROUND OF THE INVENTION
Audio encoding is a process by which a typically digitized audible sound or “audio source” is converted (“encoded”) into an “encoded audio” form for storage, transfer and/or manipulation. In a complimentary fashion, audio decoding converts encoded audio (typically received from storage and/or via data transfer) into decoded audio, which can then be rendered and played back as audible sound. An audio encoding system typically includes at least one encoder and one decoder as integrated elements within one or more host processing systems.
Particularly problematic in encoding system design have been conflicting requirements of providing high fidelity audio, in a desired form, and using a minimally complex system. More specifically, an audio encoder will ideally deliver perceptually “lossless” encoded audio. That is, the encoded audio, when decoded and rendered, should sound identical to the source audio (i.e. with no audible artifacts or other perceivable loss of fidelity). On the other hand, it is also desirable to minimize the amount of encoded data in order to preserve available storage, throughput and bandwidth for other uses. To pursue these requirements, encoding system designers have relied largely on established data reduction methods which should theoretically preserve audio fidelity. However, to achieve high fidelity, such methods have failed to provide sufficiently low bit rates. Conversely, these methods, particularly when merely approximated to reduce bit rates and/or complexity, have not assured high fidelity.
Adding to these problems is the manner in which “processed” encoded audio is typically provided using conventional low bit-rate encoding methods. Processing, such as time and frequency modification, is conducted on non-encoded audio. Audio is typically stored and/or transferred in encoded form (thereby conserving storage or bandwidth), then decoded, then time or frequency modified, then re-encoded, and then once again transferred and/or stored (again conserving system resources). However, given the complexity of conventional encoding methods, decoding and re-encoding has become computationally expensive.
For example, in transform coding, a digital audio source is broken down into frames (typically about 5 to 50 milliseconds long). Each frame is then converted into spectral coefficients using a time-domain aliasing cancellation filter bank. Finally, the spectral coefficients are quantized according to a psychoacoustic model. During decoding, the quantized spectral coefficients are used to re-synthesize the encoded audio.
Advantageously, transform coding is relatively computationally efficient and is capable of producing perceptually lossless encoding. It is therefore preferred where high fidelity encoding of an audio source is critical. Unfortunately, such high fidelity comes at the cost of a large amount of encoded data or “high bit rate”. For example, a one minute audio would produce 480 kilobytes of transform-coded audio data, resulting in a compression ratio of only 11 to 1. Thus, transform coding is considered inappropriate for high compression applications. In addition, conventional methods used to quantize transform coded audio data can result in a substantial loss of its high fidelity benefits. Broadly stated, quantization is a form of data reduction in which approximations, which are considered substantially representative of actual data, are substituted for the actual data. Conventionally, transform coded data is quantized by encoding less than the complete frequency range of the audio source. Yet another disadvantage is that transform coding is that it is not considered amenable to time or frequency modification. With time modification, audio data is modified to playback faster or slower at the same pitch. Conversely, frequency modification alters the playback pitch without affecting playback speed.
Another example, conventional sinusoidal modeling, is used as an alternative to transform coding. In sinusoidal modeling, an entire audio data stream is analyzed in time increments or “windows”. For each window, a fast Fourier transform (“FFT”) is used to determine the primary audio frequency components or “spectral peaks” of the source audio. The spectral peaks are then modeled as a number of sine waves, each sine wave having specific amplitude, frequency and phase characteristics. Next, these characteristics or “sinusoidal parameter triads” are quantized. The resultant encoded audio is then typically stored and/or transferred. During decoding, each of the representative sine waves is synthesized from a corresponding set of sinusoidal parameters.
An advantage of sinusoidal modeling is that it tends to represent an audio source using a relatively small amount of data. For example, the above 1 minute audio source can be represented using only 120 kilobytes of encoded audio. Comparing the encoded audio to the audio source, this represents a data reduction or “compression ratio” of 44 to 1. (In practice, encoder designs are generally targeted at achieving compression ratio of 10 to 1 or more.)
Sinusoidal modeling is also generally well-suited to such audio data modifications as time and frequency scaling. In sinusoidal modeling, time scaling is conventionally achieved by altering the decoder's window length relative to the window length in the encoder. Frequency modification is further conventionally achieved by scaling the frequency information in the “parameter triads”. Both are well established methods.
Unfortunately, conventional sinusoidal modeling also has certain disadvantages. First, sinusoidal modeling poorly models certain audio components. Sound can be viewed as comprising a combination of short tonal or atonal attacks or “transients” (e.g. striking a drum), as well as relatively stable tonal components (“steady-state”) and noise components. While sinusoidal modeling represents steady-state portions relatively well, it does a relatively poor job of representing transients and noise. Transients, many of which are crisp combinations of tone and noise, tend to become muddled. Further, an attempt to represent transients or noise using sinusoids requires an large number of short sine waves, thereby increasing bit rate. In addition, while sinusoidal encoding is generally well suited to data modification or “audio processing”, it tends to exaggerate the above deficiencies with regard to transients. For example, time compression and expansion tend to unnaturally sharpen or muddle transients, and frequency modification tends to unnaturally color the tone quality of transients.
Another sinusoidal modeling disadvantage is that conventional methods used to quantize sinusoidal models tend to cause a readily perceived degradation of audio fidelity. One approach to sinusoidal model quantization is based on established human hearing limitations. It is well-known that, where a listener is presented with two sound components that are close in frequency, a lower energy component can be masked by a higher energy component. More simply, louder audio components can mask softer ones. Thus, an analysis of an audio source can be conducted according to a “psychoacoustic model” of human hearing. Then, those frequency components which the model suggests would be masked (i.e. would not be heard by a listener) are discarded.
This first approach is typically implemented according to one of the following methods. In a first method, masking is presumed. That is, all sinusoids measured as being below a predetermined threshold energy level are summarily presumed to be masked and are therefore discarded. In a second method, a psychoacoustic analysis is performed on each frame of sinusoids and those sinusoids which are deemed inaudible due to determined masking effects are discarded. A third method adds an iterative aspect to the
Dorvil Richemond
Ivey James D.
Liquid Audio Inc.
LandOfFree
Audio encoding apparatus and methods does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Audio encoding apparatus and methods, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Audio encoding apparatus and methods will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2525293