Fast frequency-domain pitch estimation

Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Fast frequency-domain pitch estimation Fast frequency-domain pitch estimation

: 2000-07-14
: 2003-07-01
: Banks-Harold, Marsha D. (Department: 2654)
: Data processing: speech signal processing, linguistics, language
: Speech signal processing
: For storage or transmission

: C704S204000
: Reexamination Certificate
: active
: 06587816
: ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to methods and apparatus for processing of audio signals, and specifically to methods for estimating the pitch of a speech signal.
BACKGROUND OF THE INVENTION
Speech sounds are produced by modulating air flow in the speech tract. Voiceless sounds originate from turbulent noise created at a constriction somewhere in the vocal tract, while voiced sounds are excited in the larynx by periodic vibrations of the vocal cords. Roughly speaking, the variable period of the laryngeal vibrations gives rise to the pitch of the speech sounds. Low-bit-rate speech coding schemes typically separate the modulation from the speech source (voiced or unvoiced), and code these two elements separately. In order to enable the speech to be properly reconstructed, it is necessary to accurately estimate the pitch of the voiced parts of the speech at the time of coding. A variety of techniques have been developed for this purpose, including both time- and frequency-domain methods. A number of these techniques are surveyed by Hess in Pitch Determination of Speech Signals (Springer-Verlag, 1983), which is incorporated herein by reference.
The Fourier transform of a periodic signal, such as voiced speech, has the form of a train of impulses, or peaks, in the frequency domain. This impulse train corresponds to the line spectrum of the signal, which can be represented as a sequence {(a
i
, &thgr;
i
)}, wherein &thgr;
i
are the frequencies of the peaks, and a
i
are the respective complex-valued line spectral amplitudes. To determine whether a given segment of a speech signal is voiced or unvoiced, and to calculate the pitch if the segment is voiced, the time-domain signal is first multiplied by a finite smooth window. The Fourier transform of the windowed signal is then given by:
X
⁡
(
θ
)
=
∑
k
⁢
a
k
⁢
W
⁡
(
θ
-
θ
k
)
(
1
)
wherein W(&thgr;) is the Fourier transform of the window.
Given any pitch frequency, the line spectrum corresponding to that pitch frequency could contain line spectral components at all multiples of that frequency. It therefore follows that any frequency appearing in the line spectrum may be a multiple of a number of different candidate pitch frequencies. Consequently, for any peak appearing in the transformed signal, there will be a sequence of candidate pitch frequencies that could give rise to that particular peak, wherein each of the candidate frequencies is an integer dividend of the frequency of the peak. This ambiguity is present whether the spectrum is analyzed in the frequency domain, or whether it is transformed back to the time domain for further analysis.
Frequency-domain pitch estimation is typically based on analyzing the locations and amplitudes of the peaks in the transformed signal X(&thgr;). For example, a method based on correlating the spectrum with the “teeth” of a prototypical spectral comb is described by Martin in an article entitled “Comparison of Pitch Detection by Cepstrum and Spectral Comb Analysis,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 180-183 (1982), which is incorporated herein by reference. The pitch frequency is given by the comb frequency that maximizes the correlation of the comb function with the transformed speech signal.
A related class of schemes for pitch estimation are “cepstral” schemes, as described, for example, on pages 396-408 of the above-mentioned book by Hess. In this technique, a log operation is applied to the frequency spectrum of the speech signal, and the log spectrum is then transformed back to the time domain to generate the cepstral signal. The pitch frequency is the location of the first peak of the time-domain cepstral signal. This corresponds precisely to maximizing over the period T, the correlation of the log of the amplitudes corresponding to the line frequencies z(i) with cos(&ohgr;(i)T). For each guess of the pitch period T, the function cos(&ohgr;T) is a periodic function of &ohgr;. It has peaks at frequencies corresponding to multiples of the pitch frequency 1/T. If those peaks happen to coincide with the line frequencies, then 1/T is a good candidate to be the pitch frequency, or some multiple thereof.
In another vein, a common method for time-domain pitch estimation use correlation-type schemes, which search for a pitch period T that maximizes the cross-correlation of a signal segment centered at time t and one centered at time t-T. The pitch frequency is the inverse of T. A method of this sort is described, for example, by Medan et al., in “Super Resolution Pitch Determination of Speech Signals,” published in IEEE Transactions on Signal Processing 39(1), pages 41-48 (1991), which is incorporated herein by reference.
Both time- and frequency-domain methods of pitch determination are subject to instability and error, and accurate pitch determination is therefore computationally intensive. In time domain analysis, for example, a high-frequency component in the line spectrum results in the addition of an oscillatory term in the cross-correlation. This term varies rapidly with the estimated pitch period T when the frequency of the component is high. In such a case, even a slight deviation of T from the true pitch period will reduce the value of the cross-correlation substantially and may lead to rejection of a correct estimate. A high-frequency component will also add a large number of peaks to the cross-correlation, which complicate the search for the true maximum. In the frequency domain, a small error in the estimation of a candidate pitch frequency will result in a major deviation in the estimated value of any spectral component that is a large integer multiple of the candidate frequency.
An exhaustive search, with high resolution, must therefore be made over all possible candidates and their multiples in order to avoid missing the best candidate pitch for a given input spectrum. It is often necessary (dependent on the actual pitch frequency) to search the sampled spectrum up to high frequencies, above 1500 Hz. At the same time, the analysis interval, or window, must be long enough in time to capture at least several cycles of every conceivable pitch candidate in the spectrum, resulting in an additional increase in complexity. Analogously, in the time domain, the optimal pitch period T must be searched for over a wide range of times and with high resolution. The search in either case consumes substantial computing resources. The search criteria cannot be relaxed even during intervals that may be unvoiced, since an interval can be judged unvoiced only after all candidate pitch frequencies or periods have been ruled out. Although pitch values from previous frames are commonly used in guiding the search for the current value, the search cannot be limited to the neighborhood of the previous pitch. Otherwise, errors in one interval will be perpetuated in subsequent intervals, and voiced segments may be confused for unvoiced.
Various solutions have been proposed for improving the accuracy and efficiency of pitch determination. For example, McAulay et al. describe a method for tracking the line frequencies of speech signals and for reproducing the signal from these frequencies in U.S. Pat. No. 4,885,790 and in an article entitled “Speech Analysis/Synthesis Based on a Sinusoidal Representation,” in IEEE Transactions on Acoustics, Speech and Signal Processing ASSP-34(4), pages 744-754 (1986). These documents are incorporated herein by reference. The authors use a sinusoidal model for the speech waveform to analyze and synthesize speech based on the amplitudes, frequencies and phases of the component sine waves in the speech signal. Any number of methods may be used to obtain the pitch values from the line frequencies. In U.S. Pat. No. 5,054,072, whose disclosure is also incorporated herein by reference, McAulay et al. describe refinements of their method. In one of these refinements, a pitch-adaptive channel encoding technique varies the channel spacing

Affiliated with

Chazan Dan

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Hoory Ron

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Zibulski Meir

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Banks-Harold Marsha D.

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Darby & Darby

Law Firm

[ 0.00 ] – not rated yet Voters 0 Comments 0

International Business Machines - Corporation

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

Storm Donald L.

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Fast frequency-domain pitch estimation does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Fast frequency-domain pitch estimation, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fast frequency-domain pitch estimation will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-3005280

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure