Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
1999-10-29
2002-02-19
Tsang, Fan (Department: 2645)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S205000
Reexamination Certificate
active
06349277
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a method for analyzing pitches and powers of voices in detail, a method and a medium for synthesizing high quality voices, and compressing and encoding voices efficiently using the analyzing method.
2. Related Art of the Invention
An object of a voice synthesizing system is to synthesize given contents of a voice as voice waveforms. There have been invented various methods for synthesizing voices so far. A representative method among them is a waveform editing and synthesizing method that stores voice waveforms in a fine unit in advance (in synthesis units), then select and connect proper units appropriately to target contents.
In such a voice synthesizing method, feelings of discontinuation and wrongness generated when units are connected can be lowered by changing the pitch and the time length of each unit, thereby to synthesize voices smoothly. One of the well-known methods for changing pitches and time lengths such way is, for example, the PSOLA (Pitch Synchronous Overlap Add) method (F. Charpentier, M. Stella, “Diphone synthesis using an over-lapped technique for voice waveforms concatenation”, Proc. ICASSP, 2015-2018, Tokyo, 1986). In this method, pitch marks are assigned to local peak positions and glottal closures of unit waveforms in advance, so that pitch waveforms are selected out around each of those pitch-marked positions using a window function. Voices are thus synthesized properly.
As a pitch marking method used for voice synthesizing as described above, there are methods in which pitch marks are assigned to local peaks of time waveforms and to glottal closures. An example of the method for assigning pitch marks to local peaks of time waveforms is introduced in “Constructing a Waveform Inventory for Text-to-Speech Synthesis Based on Waveform Splicing” (Proc. Autumn Meeting Acoust. Soc. Japan, 3-5-5, 1994-11). The advantage of this method is simplicity. For complicated voice waveforms including many high frequency components, however, it is difficult to assign a pitch mark to each pitch cycle. In addition, the peak itself has a time fluctuation caused by such high frequency components. Consequently, synthesized waveforms have a phase fluctuation in each pitch cycle. This then arises a problem of thick voices, which makes listeners feel uncomfortable.
On the other hand, a method for assigning pitch marks to glottal closures of voice waveforms is introduced in M. Sakamoto et al.: “A New Waveform Overlap-Add Technique for Text-to-Speech Synthesis”, Technical Report of IEICE SP95-6 (1995-05) and by Y. Arai et al.: “A Study on the Optimal Window Position to Extract Pitch Waveforms Based on a Speech Signal Model.”, Proc. Spring meeting Acoust. Soc. Japan, 1-4-22, 1995-3. In the method, voice waveforms are analyzed using a wavelet transform method and a linear prediction analysis method, thereby to presume a glottal closure timing and assign a pitch mark to the timing position. The glottal closure extracting method has an advantage that one pitch mark can be assigned accurately to each pitch cycle. Since this method is equivalent to a method for selecting out response waveforms corresponding to glottal closure pulses, pitch waveforms can be selected out with less spectrum distortion. The method is thus favorable from the viewpoint of selecting out waveforms. This method, however, has a problem that the method for analyzing and presuming glottal closure is complicated.
In addition to those methods, there is also a technology for extracting fundamental component of a voice using an FIR linear phase band-pass filter that specifies a passing band around the voice pitch frequency adaptively and partitioning the voice waveform for each pitch cycle using a zero-cross position. The technology is introduced in “Fine Pitch Contour Extraction by voice Fundamental Wave Filtering Method”, Journal of Acoust. Soc. Japan, Vol.51, No.7, pp.509-518, 1995. This method is used to analyze fine pitches, but it is also used to find pitch cycles synchronizing with fundamental waveform.
A partitioning point extracted by the above method is not related directly to any of local peaks and glottal closures of voice waveforms. It is not proper therefore to use such a partitioning point as a pitch mark with no change sometimes.
As described above, the method for using a local peak on time waveforms as a pitch mark has a problem that thick voices are generated in synthesized voices, since the pitch mark includes a fluctuation generated around each peak of time waveforms. And, the method for using a glottal closures as a pitch mark has a problem that the processing for presuming glottal closures is complicated. In addition, the method for filtering fundamental component also has a problem that a proper timing to be used as a pitch mark cannot be extracted.
SUMMARY OF THE INVENTION
Under such the circumstances, it is an object of the present invention to provide a method for analyzing voices, which can assign pitch marks more simply and more properly than related arts and a method and a medium for synthesizing higher quality voices than the related arts.
One aspect of the method according to the invention is for analyzing voices which generates pitch mark information assumed to be time reference positions corresponding to a pitch cycle of voice waveforms, by using means for storing voice waveforms; means for analyzing pitches; an adaptive filter; and means for detecting peaks, wherein
some of said voice waveforms are stored temporarily using said voice waveform storing means;
rough pitch information is generated from said voice waveforms stored temporarily, by using said pitch analyzing means;
said voice waveforms stored temporarily is entered to said adaptive filter and by changing a cut-off frequency or a center frequency of said adaptive filter according to said rough pitch information, only fundamental component extracted from the entered voice waveforms is passed; and
plural maximum points are detected at one side of said basic waves by using said peak detecting means, thereby to generate a series of accurate pitch mark information for the whole voice waveforms.
A method of claim 2 is for analyzing voices, which generates pitch mark information assumed to be time reference positions corresponding to a pitch cycle of voice waveforms by using plural peak detecting channels each of which is a set of a fixed low-pass filter and a peak detecting means, and means for selecting a channel, wherein
cut-off frequencies of said plural fixed low-pass filters are set so that at least one of said plural fixed low-pass filters passes only fundamental component of entered voice waveforms;
each of said fixed low-pass filters is used to output waveforms of low frequency components of specified frequencies of the entered voice waveforms;
said peak detecting means is used to detect plural maximum points on one side of waveforms of said low frequency components output from said fixed low-pass filter and to output said detected plural maximum points as a peak information;
said channel selecting means is used to select a peak detecting channel every a predetermined period on a basis of a specified selection reference by using all or some of the peak informations output from said plural peak detecting channels; and
a series of pitch mark information is generated for the whole voice waveforms by using the peak information output from said selected peak detecting channel.
Still another aspect of the method according to the invention is for synthesizing voices where by analyzing target voice waveforms which are recorded in advance, phoneme series information, phoneme timing information, pitch information, amplitude information are generated, and
voices are synthesized according to said phoneme series information, said phoneme timing information, said pitch information, and said amplitude information, wherein said phoneme series information holds types of phonemes and their appearance order in said target voice waveforms;
said pitch inform
Kamai Takahiro
Matsui Kenji
Opsasnick Michael N.
Ratner & Prestia, PC
Tsang Fan
LandOfFree
Method and system for analyzing voices does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for analyzing voices, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for analyzing voices will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2948280