Fractional pitch method

Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S216000

Reexamination Certificate

active

06463406

ABSTRACT:

BACKGROUND OF THE INVENTION
The invention relates to electronic devices, and, more particularly, to speech coding, transmission, storage, and synthesis circuitry and methods.
Human speech consists of a stream of acoustic signals with frequencies ranging up to roughly 20 KHz; however, the band of about 100 Hz to 5 KHz contains the bulk of the acoustic energy. Telephone transmission of human speech originally consisted of conversion of the analog acoustic signal stream into an analog voltage signal stream (e.g., use a microphone) for transmission and reconversion to an acoustic signal stream (e.g., use a loudspeaker). The electrical signals would be bandpass filtered to retain only the 300 Hz to 4 KHz band to limit bandwidth and avoid low frequency problems. However, the advantages of digital electrical signal transmission has inspired a conversion to digital telephone transmission beginning in the 1960s. Typically, digital telephone signals derive from sampling analog signals at 8 KHz and nonlinearly quantizing the samples with 8 bit codes according to the &mgr;-law (pulse code modulation, or PCM). A clocked digital-to-analog converter and companding amplifier reconstruct an analog electric signal stream from the stream of 8-bit samples. Such signals require transmission rates of 64 Kbps (kilobits per second) and this exceeds the former analog signal transmission bandwidth.
The storage of speech information in analog format (for example, on magnetic tape in a telephone answering machine) can likewise by replaced with digital storage. However, the memory demands can become overwhelming: 10 minutes of 8-bit PCM sampled at 8 KHz would require about 5 MB (megabytes) of storage.
The demand for lower transmission rates and storage requirements has led to development of compression for speech signals. One approach to speech compression models the physiological generation of speech and thereby reduces the necessary information to be transmitted or stored. In particular, the linear speech production model presumes excitation of a variable filter (which roughly represents the vocal tract) by either a pulse train with pitch period P (for voiced sounds) or white noise (for unvoiced sounds) followed by amplification to adjust the loudness. 1/A(z) traditionally denotes the z transform of the filter's transfer function. The model produces a stream of sounds simply by periodically making a voiced/unvoiced decision plus adjusting the filter coefficients and the gain. Generally, see Markel and Gray, Linear Prediction of Speech (Springer-Verlag 1976).
FIG. 1
illustrates the model, and
FIGS. 2
a
-
3
b
illustrate sounds. In particular,
FIG. 2
a
shows the waveform for the voiced sound /ae/ and
FIG. 2
b
its Fourier transform; and
FIG. 3
a
shows the unvoiced sound /sh/ and
FIG. 3
b
its Fourier transform.
The filter coefficients may be derived as follows. First, let s′(t) be the analog speech waveform as a function of time, and e′(t) be the analog speech excitation (pulse train or white noise). Take the sampling frequency f
s
to have period T (so f
s
=1/T), and set s(n)=s′(nT) (so . . . s(n−1), s(n), s(n+1), . . . is the stream of speech samples), and set e(n)=e′(nT) (so . . . e(n−1), e(n), e(n+1), . . . are the samples of the excitation). Then taking z transforms yields S(z)=E(z)/A(z) or, equivalently, E(z)=A(z)S(z) where 1/A(z) is the z transform of the transfer function of the filter. A(z) is an all-zero filter and 1/A(z) is an all-pole filter. Deriving the excitation, gain, and filter coefficients from speech samples is an analysis or coding of the samples, and reconstructing the speech from the excitation, gain, and filter coefficients is a decoding or synthesis of speech. The peaks in 1/A(z) correspond to resonances of the vocal tract and are termed “formants”.
FIG. 4
heuristically shows the relations between voiced speech and voiced excitation with a particular filter A(z).
With A(z) taken as a finite impulse response filter of order M, the equation E(z)=A(z)S(z) in the time domain becomes, with a(0)=1 for normalization:
e

(
n
)
=



j

a

(
j
)

s

(
n
-
j
)


0

j

M
=


s

(
n
)
+

j

a

(
j
)

s

(
n
-
j
)


1

j

M
Thus by deeming e(n) a “linear prediction error” between the actual sample s(n) and the “linear prediction” sum a(j)s(n−j), the filter coefficients a(j) can be determined from a set of samples s(n) by minimizing the prediction “error” sum e(n)
2
.
A stream of speech samples s(n) may be partitioned into “frames” of 180 successive samples (22.5 msec intervals), and the samples in a frame provide the data for computing the filter coefficients for use in coding and synthesis of the sound associated with the frame. Typically, M is taken as 10 or 12. Encoding a frame requires bits for the LPC coefficients, the pitch, the voiced/unvoiced decision, and the gain, and so the transmission rate may be only 2.4 Kbps rather than the 64 Kbps of PCM. In practice, the filter coefficients must be quantized for transmission, and the sensitivity of the filter behavior on the quantization error has led to quantization based on the Line Spectrum Pair representation.
The pitch period P determination presents a difficult problem because 2P, 3P, . . . are also periods and the sampling quantization and the formants can distort magnitudes. In fact, W.Hess, Pitch Determination of Speech Signals (Springer, 1983) presents many different methods for pitch determination. For example, the pitch period estimation for a frame may be found by searching for maximum correlations of translates of the speech signal. Indeed, Medan et al, Super Resolution Pitch Determination of Speech Signals, 39 IEEE Tr.Sig.Proc. 40 (1991) describe a pitch period determination which first looks at correlations of two adjacent segments of speech with variable segment lengths and determines an integer pitch as the segment length which yields the maximum correlation. Then linear interpolation of correlations about the maximum correlation gives a pitch period which may be a nonintegral multiple of the sampling period.
The voiced/unvoiced decision for a frame may be made by comparing the maximum correlation c(k) found in the pitch search with a threshold value: if the maximum c(k) is too low, then the frame will be unvoiced, otherwise the frame is voiced and uses the pitch period found.
The overall loudness of a frame may be estimated simply as the root-mean-square of the frame samples takig into account the gain of the LPC filtering. This provides the gain to apply in the synthesis.
To reduce the bit rate, the coefficients for successive frames may be interpolated.
However, to improve the sound quality, further information may be extracted from the speech, compressed and transmitted or stored. For example, the codebook excitation linear prediction (CELP) method first analyzes a speech frame to find A(z) and filter the speech, next, a pitch period determination is made and a comb filter removes this periodicity to yield a noise-looking excitation signal. Then the excitation signals are encoded in a codebook. Thus CELP transmits the LPC filter coefficients, the pitch, and the codebook index of the excitation.
Another approach is to mix voiced and unvoiced excitations for the LPC filter. For example, McCree, A New LPC Vocoder Model for Low Bit Rate Speech Coding, PhD thesis, Georgia Institute of Technology, August 1992, divide the excitation frequency range into bands, make the voiced/unvoiced mixture decision in each band separately, and combine the results for the total excitation. The pitch determination proceeds as follows. First, lowpass filter (cutoff at about 1200 Hz) the speech because the pitch frequency should fall in the range of 100 Hz to 400 Hz. Next, filter with A(z) in order to remove the formant structure and, hopefully, yield e(n). Then compute a normalized correlation for each translate k:
c
(
k

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Fractional pitch method does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Fractional pitch method, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fractional pitch method will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2988612

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.