Data processing: speech signal processing – linguistics – language – Audio signal time compression or expansion
Reexamination Certificate
2000-07-26
2004-04-06
Dorvil, Richemond (Department: 2654)
Data processing: speech signal processing, linguistics, language
Audio signal time compression or expansion
C341S061000
Reexamination Certificate
active
06718309
ABSTRACT:
FIELD OF THE INVENTION
This invention relates generally to digital audio signal processing. More particularly, it relates to a method for modifying the output rate of audio signals without changing the pitch, using an improved synchronized overlap-and-add (SOLA) algorithm.
BACKGROUND ART
A variety of applications require modification of the playback rate of audio signals. Techniques falling within the category of Time Scale Modification (TSM) include both compression (i.e., speeding up) and expansion (i.e., slowing down). Audio compression applications include speeding up radio talk shows to permit more commercials, allowing users or disc jockeys to select a tempo for dance music, speeding up playback rates of dictation material, speeding up playback rates of voicemail messages, and synchronizing audio and video playback rates. Regardless of the type of input signal—speech, music, or combined speech and music—the goal of TSM is to preserve the pitch of the input signal while changing its tempo. Clearly, simply increasing or decreasing the playing rate necessarily changes pitch.
The synchronized overlap-and-add technique was introduced in 1985 by S. Roucos and A. M. Wilgus in “High Quality Time Scale Modification for Speech,”
IEEE Int. Conf. ASSP,
493-496, and is still the foundation for many recently developed techniques. The method is illustrated schematically in
FIG. 1A. A
digital input signal
10
is obtained by digitally sampling an analog audio signal to obtain a series of time domain samples x(t). Input signal
10
is divided into overlapping windows, blocks, or frames
12
, each containing N samples and offset from one another by S
a
samples (“a” for analysis). Scaled output
14
contains samples y(t) of the same overlapping windows, offset from each other by a different number of samples, S
s
(“s” for synthesized). Output
14
is generated by successively overlapping input windows
12
with a different time lag than is present in input
10
. The time scale ratio &agr; is defined as S
a/
S
s
; &agr;>1 for compression and &agr;<1 for expansion. A weighting function, such as a linear cross-fade, illustrated in
FIG. 1B
, is used to combine overlapped windows. To overlap an input block
16
with an output block
18
, samples in the overlapped regions of input block
16
are scaled by a linearly increasing function, while samples in output block
18
are scaled by a linearly decreasing function, to generate new output signal
20
. Note that the SOLA method changes the overall rate of the signal without changing the rates of individual windows, thereby preserving pitch.
To maximize quality of the resulting signal
14
, frames are not overlapped at a predefined separation distance. The actual offset is chosen, typically within a given range, to maximize a similarity measure between the two overlapped frames, ensuring optimal sound quality. For each potential overlap offset within a predefined search range, the similarity measure is calculated, and the chosen offset is the one with the highest value of the similarity measure. For example, a correlation function between the two frames may be computed by multiplying x(t) and y(t) at each offset. This technique produces a signal of high quality, i.e., one that sounds natural to a listener, and high intelligibility, i.e., one that can be understood easily by a listener. A variety of quality and intelligibility measures are known in the art, such as total harmonic distortion (THD).
The basic SOLA framework permits a variety of modifications in window size selection, similarity measure, computation methods, and search range for overlap offset. U.S. Pat. No. 5,479,564, issued to Vogten et al., discloses a method for selecting the window of the input signal based on a local pitch period. A speaker-dependent method known as WSOLA-SD is disclosed in U.S. Pat. No. 5,828,995, issued to Satyamurti et al. WSOLA-SD selects the frame size of the input signal based on the pitch period. A drawback of these and other pitch-dependent methods is that they can only be used with speech signals, and not with music. Furthermore, they require the additional steps of determining whether the signal is voiced or unvoiced, which can change for different portions of the signal, and for voiced signals, determining the pitch. The pitch of speech signals is often not constant, varying in multiples of a fundamental pitch period. Resulting pitch estimates require artificial smoothing to move continuously between such multiples, introducing artifacts into the final output signal.
Typically, the location within an existing output frame at which a new input frame is overlapped is selected, based on the calculated similarity measure. However, some SOLA methods use the similarity measure to select overlap locations of input blocks. U.S. Pat. No. 5,175,769, issued to Hejna, Jr. et al., discloses a method for selecting the location of input blocks within a predefined range. The method of Hejna, Jr. requires fewer computational steps than does the original SOLA method. However, it introduces the possibility of skipping completely over portions of the input signal, particularly at high compression ratios (i.e., &agr;≧2). A speech rate modification method described in U.S. Pat. Nos. 5,341,432 and 5,630,013, both issued to Suzuki et al., determines the optimal overlap of two successive input frames that are then overlapped to produce an output signal. In the traditional SOLA method, in which input frames are successively overlapped onto output frames, each output frame can be a sum of all previously overlapped frames. With the method of Suzuki et al., however, input frames are overlapped only onto each other, preventing the overlap of multiple frames. In some cases, this limited overlap may decrease the quality of the resultant signal. Thus selecting the offset within the output signal is the most reliable method, particularly at high compression ratios.
Computational cost of the method varies with the input sampling rate and compression ratios. High sampling rates are desirable because they produce higher quality output signals. In addition, high compression ratios require high processing rates of input samples. For example, CD quality audio corresponds to a 44.1 kHz sampling rate; at a compression ratio of &agr;=4, approximately 176,000 input samples must be processed each second to generate CD quality output. In order to process signals at high input sampling rates and high compression ratios, computational efficiency of the method is essential. Calculating the similarity measure between overlapping input and output sample blocks is the most computationally demanding part of the algorithm. A correlation function, one potential similarity measure, is calculated by multiplying corresponding samples of input and output blocks for every possible offset of the two blocks. For an input frame containing N samples, N
2
multiplication operations are required. At high input sampling rates, for N on the order of 1000, performing N
2
operations for each input frame is unfeasible.
As a result, the trend in SOLA is to simplify the computation to reduce the number of operations performed. One solution is to use an absolute error metric, which requires only subtraction operations, rather than a correlation function, which requires multiplication. U.S. Pat. No. 4,864,620, issued to Bialick, discloses a method that uses an Average Magnitude Difference Function (AMDF) to select the optimal overlap. The AMDF averages the absolute value of the difference between the input and output samples for each possible offset, and selects the offset with the lowest value. U.S. Pat. No. 5,832,442, issued to Lin et al., discloses a method employing an equivalent mean absolute error in overlap. While absolute error methods are significantly less computationally demanding, they are not as reliable or as well accepted as correlation functions in locating optimal offsets. A level of accuracy is sacrificed for the sake of computational efficiency.
The overwhelming majority of existin
Dorvil Richemond
Lerner Martin
Millers David T.
SSI Corporation
LandOfFree
Continuously variable time scale modification of digital... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Continuously variable time scale modification of digital..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Continuously variable time scale modification of digital... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3214567