Voice conversion system and methodology

Data processing: speech signal processing – linguistics – language – Speech signal processing – Application

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S272000, C704S217000, C704S223000

Reexamination Certificate

active

06615174

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to voice conversion and, more particularly, to codebook-based voice conversion systems and methodologies.
BACKGROUND OF THE INVENTION
A voice conversion system receives speech from one speaker and transforms the speech to sound like the speech of another speaker. Voice conversion is useful in a variety of applications. For example, a voice recognition system may be trained to recognize a specific person's voice or a normalized composite of voices. Voice conversion as a front-end to the voice recognition system allows a new person to effectively utilize the system by converting the new person's voice into the voice that the voice recognition system is adapted to recognize. As a post processing step, voice conversion changes the voice of a text-to-speech synthesizer. Voice conversion also has applications in voice disguising, dialect modification, foreign-language dubbing to retain the voice of an original actor, and novelty systems such as celebrity voice impersonation, for example, in Karaoke machines.
In order to convert speech from a “source” voice to a “target” voice, codebooks of the source voice and target voice are typically prepared in a training phase. A codebook is a collection of “phones,” which are units of speech sounds that a person utters. For example, the spoken English word “cat” in the General American dialect comprises three phones [K], [AE], and [T], and the word “cot” comprises three phones [K], [AA], and [T]. In this example, “cat” and “cot” share the initial and final consonants but employ different vowels. Codebooks are structured to provide a one-to-one mapping between the phone entries in a source codebook and the phone entries in the target codebook.
U.S. Pat. No. 5,327,521 describes a conventional voice conversion system using a codebook approach. An input signal from a source speaker is sampled and preprocessed by segmentation into “frames” corresponding to a speech unit. Each frame is matched to the “closest” source codebook entry and then mapped to the corresponding target codebook entry to obtain a phone in the voice of the target speaker. The mapped frames are concatenated to produce speech in the target voice. A disadvantage with this and similar conventional voice conversion systems is the introduction of artifacts at frame boundaries leading to a rather rough transition across target frames. Furthermore, the variation between the sound of the input speech frame and the closest matching source codebook entry is discarded, leading to a low quality voice conversion.
A common cause for the variation between the sounds in speech and in codebook is that sounds differ depending on their position in a word. For example, the /t/ phoneme has several “allophones.” At the beginning of a word, as in the General American pronunciation of the word “top”, the /t/ phoneme is an unvoiced, fortis, aspirated, alveolar stop. In an initial cluster with an /s/, as in the word “stop,” it is an unvoiced, fortis, unaspirated, alveolar stop. In the middle of a word between vowels, as in “potter,” it is an alveolar flap. At the end of a word, as in “pot,” it is an unvoiced, lenis, unaspriated, alveolar stop. Although the allophones of a consonant like /t/ are pronounced differently, a codebook with only one entry for the /t/ phoneme will produce only one kind of /t/ sound and, hence, unconvincing output. Prosody also accounts for differences in sound, since a consonant or vowel will sound somewhat different when spoken at a higher or lower pitch, more or less rapidly, and with greater or lesser emphasis.
Accordingly, one conventional attempt to improve voice conversion quality is to greatly increase the amount of training data and the number of codebook entries to account for the different allophones of the same phoneme and different prosodic conditions. Greater codebook sizes lead to increased storage and computational costs. Conventional voice conversion systems also suffer in a loss of quality because they typically perform their codebook mapping in an acoustic space defined by linear predictive coding coefficients. Linear predictive coding is an all-pole modeling of speech and, hence, does not adequately represent the zeroes in a speech signal, which are more commonly found in nasal and sounds not originating at the glottis. Linear predictive coding also has difficulties with higher pitched sounds, for example, women's voices and children's voices.
SUMMARY OF THE INVENTION
There exists a need for a voice conversion system and methodology having improved quality output, but preferably still computationally tractable. Differences in sound due to word position and prosody need to be addressed without increasing the size of codebooks. Furthermore, there is a need to account for voice features that are not well supported by linear predictive coding, such as the glottal excitation, nasalized sounds, and sounds not originating at the glottis.
Accordingly, one aspect of the invention is a method and a computer-readable medium bearing instructions for transforming a source signal representing a source voice into a target signal representing a target voice. The source signal is preprocessed to produce a source signal segment, which is compared with source codebook entries to produce corresponding weights. The source signal segment is transformed into a target signal segment based on the weights and corresponding target codebook entries and post processed to generate the target signal. By computing a weighted average, a composite source voice can be mapped to a corresponding composite target voice, thereby reducing artifacts at frame boundaries and leading to smoother transitions between frame boundaries without having to employ a large number of codebook entries.
In another aspect of the invention, the source signal segment is compared with the source codebook entries as line spectral frequencies to facilitate the computation of the weighted average. In still another aspect of the invention, the weights are refined by a gradient descent analysis to further improve voice quality. In a further aspect of the invention, both vocal tract characteristics and excitation characteristics are transformed according to the weights, thereby handling excitation characteristics in a computationally tractable manner.
Additional needs, objects, advantages, and novel features of the present invention will be set forth in part in the description that follows, and in part, will become apparent upon examination or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.


REFERENCES:
patent: 5113449 (1992-05-01), Blanton et al.
patent: 5327521 (1994-07-01), Savic et al.
patent: 5704006 (1997-12-01), Iwahashi
patent: 6161091 (2000-12-01), Akamine et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Voice conversion system and methodology does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Voice conversion system and methodology, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Voice conversion system and methodology will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3091182

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.