Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
2000-10-20
2004-12-28
Dorvil, Richemond (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
C704S268000, C704S270000, C084S609000, C084S610000
Reexamination Certificate
active
06836761
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a voice converter for assimilating a user voice to be processed to a different target voice, a voice converting method, and a voice conversion dictionary generating method for generating a voice conversion dictionary corresponding to the target voice used for the voice conversion, and more particularly to a voice converter, a voice converting method, and a voice conversion dictionary generating method preferred to be used for a karaoke apparatus.
In addition, the present invention relates to a voice processing apparatus for associating in time series a target voice with an input voice for temporal alignment, and to a karaoke apparatus having the voice processing apparatus.
2. Related Background Art
There have been developed various kinds of voice converters which change frequency characteristics of an input voice before an output. For example, there are karaoke apparatuses that convert a pitch of a singing voice of a karaoke player so as to convert a male voice to a female voice or vice versa (for example, Japanese PCT Publication No. 8-508581).
In the conventional voice converters, however, the voice conversion is limited to a conversion in only a voice quality though a voice is converted (for example, a male voice to a female voice, a female voice to a male voice, etc.) and therefore they are not capable of converting a voice to another in imitation of a voice of a specific singer (for example, a professional singer).
Furthermore, a karaoke apparatus would be very entertaining if it had something like an imitative function of assimilating not only a voice quality but also a way of singing to that of the professional singer. In the conventional voice converters, however, this kind of processing is impossible.
Accordingly, the inventors suggest a voice converter for a conversion in imitation of a voice of a singer to be targeted (a target singer) by analyzing the target singer's voice so as to assimilate a voice quality of the user to the target singer's voice, retaining achieved analysis data including a sinusoidal component attribute pitch, an amplitude, a spectrum shape, and residual components as target frame data for all frames of a music piece, and performing a conversion in synchronization with the input frame data obtained by analyzing the input voice (Refer to Japanese Patent Application No. 10-183338).
While the above voice converter is capable of assimilating not only a voice quality, but also a way of singing to that of the target singer, analysis data of the target singer is required for each music piece and therefore a data amount becomes enormously large when analysis data of a plurality of music pieces are stored.
Conventionally in a technical field of karaoke or the like, there has been provided a voice processing technology of converting a singing voice of a singer to another in imitation of a singing voice of a specific singer such as a professional singer. Generally this voice processing requires an execution of alignment for associating two voice signals with each other in time series. For example, in synthesizing a target singer's voice vocalized “nakinagara (with tears)” based on a singer's voice vocalized “nakinagara” in imitation of the target, the sound “ki” may be vocalized by the target singer at a different timing from that of the user singer.
In this manner, even if each person vocalizes the same sound, the duration is not identical and the sound may be non-linearly elongated or contracted in many cases. Therefore, in a comparison of two voices, there is known a DP matching method (dynamic time warping: DTW) for time normalization by elongating and contracting a time axis non-linearly so that the phonemes correspond to each other in the two voices. In the DP matching method, a typical time series is used as a standard pattern regarding a word or a phoneme, and therefore voices can be matched in units of a phoneme against a temporal structural change of a time-series pattern.
Additionally, there is known a technique using a hidden Markov model .(HMM) having an excellent effect against a spectral fluctuation. In the hidden Markov model, a statistical fluctuation in the spectral time series can be reflected on a parameter of a model and therefore voices can be matched in units of a phoneme against a spectral fluctuation caused by individual variations of speakers.
However, the use of the above DP matching method deteriorates a precision for a spectral fluctuation and the conventional use of a hidden Markov model requires a large amount of a storage capacity and computation, and therefore both of them are unsuitable for voice process requiring real-time characteristics such as imitation in a karaoke apparatus.
SUMMARY OF THE INVENTION
Therefore, it is an object of the present invention to provide a voice converter capable of assimilating an input singer's voice to a target voice in a way of singing of a target singer and capable of reducing an analysis data amount of the target singer, voice converting method, and a voice conversion dictionary generating method.
It is another object of the present invention to provide a voice processing apparatus capable of executing real-time processing with a small storage capacity for voice processing of associating in time series a target voice with an input voice for temporal alignment, and a karaoke apparatus having the voice processing apparatus.
In one aspect of the invention, a voice converting apparatus is constructed for converting an input voice into an output voice according to a target voice. The apparatus comprises a storage section that provisionally stores source data, which is associated to and extracted from the target voice, an analyzing section that analyzes the input voice to extract therefrom a series of input data frames representing the input voice, a producing section that produces a series of target data frames representing the target voice based on the source data, while aligning the target data frames with the input data frames to secure synchronization between the target data frames and the input data frames, and a synthesizing section that synthesizes the output voice according to the target data frames and the input data frames.
Preferably, the storage section stores the source data containing pitch trajectory information representing a trajectory of a pitch of a phrase constituted by the target voice, phonetic notation information representing a sequence of phonemes with duration thereof in correspondence with the phrase of the target voice, and spectrum shape information representing a spectrum shape of each phoneme of the target voice. Further, the storage section stores the source data containing amplitude trajectory information representing a trajectory of an amplitude of the phrase constituted by the target voice.
Preferably, the producing section comprises a characteristic analyzer that extracts from the input voice a characteristic vector which is characteristic of the input voice, a memory that memorizes recognition phoneme data for use in recognition of phonemes contained in the input voice and target behavior data which is a part of the source data and which represents a behavior of the target voice, an alignment processor that determines a temporal relation between the input data frames and the target data frames according to the characteristic vector, the recognition phoneme data and the target behavior data so as to output alignment data corresponding to the determined temporal relation, and a target decoder that produces the target data frames according to the alignment data, the input data frames and the source data containing phoneme data representing phonemes of the target voice. Further, the producing section comprises a data converter that converts the target behavior data in response to parameter control data provided from an external into pitch trajectory information representing a trajectory of a pitch of the target voice, amplitude trajectory informati
Bonada Jordi
Cano Pedro
Kawashima Takahiro
Loscos Alex
Schiementz Mark
Dorvil Richemond
Han Qi
Pillsbury & Winthrop LLP
Yamaha Corporation
LandOfFree
Voice converter for assimilation by frame synthesis with... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Voice converter for assimilation by frame synthesis with..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Voice converter for assimilation by frame synthesis with... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3317368