Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1998-01-22
2002-02-19
Korzuch, William (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S246000
Reexamination Certificate
active
06349281
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of Invention
The present invention relates to a voice recognition learning data creation method and an apparatus which creates learning data in order to learn the voice model used for unspecified speaker voice recognition.
2. Description of Related Art
As one of the voice recognition technologies used for unspecified speakers, there is the voice recognition technology which uses the Dynamic Recurrent Neural Network (DRNN) voice recognition model. Applicants have completed the submission of applications concerning voice recognition technology accomplished by DRNN, as Japanese Laid Open Patents hei 6-4079 and hei 6-119476.
In the DRNN voice model, a characteristic vector series of some words is input as time series data. Then, in order to obtain an appropriate output for the words, there is a build up between each unit in accordance with a pre-learning precedent and a bias which is respectively determined. As a result, an output is obtained in relation to spoken voice data of non-specified speakers which is close to the taught output for the words.
For example, the time series data of the characteristic vector series of the words “ohayo—good morning” of some unspecified speaker is input. Then, in order to obtain an output which is close to the taught output which is ideally output for the words “ohayo—good morning”, data for each respective two dimensions of the characteristic vector in each time of the words “ohayo—good morning” are applied to the corresponding input unit and converted by the established buildup of the learning precedent and bias. Time series processing is then performed for the time series data of each of the characteristic vector series of some input single word. As a result, output which is close to the taught output for the word is obtained for the voice data spoken by some non-specified speaker.
With regard to the DRNN voice model prepared for all of the words which should be recognized, the learning precedent which changes the buildup to obtain an appropriate output for the respective words is recorded from pages 17-24 of the communications sounds technological report of the electronic information communications association publication “Technical Report of IEICI sp 92-125 (1993-01).”
The present invention is not limited to the DRNN voice model. At the time of creating a voice model using unspecified speaker voice recognition, a database is used in which the learning data is created from the speech data for the spoken words (for example, about 200 words) of several hundred people. Ordinarily a voice model is created which accomplishes learning on the basis of the learning data included in the database.
However, there are cases in which a voice model is created for words which are not in the database and which must be obtained from the user. Prior to this invention, when creation of a voice model was accomplished for words not in the database, several hundred persons were asked to say the words and learning data for the words was created using these as a source of the spoken data. Hence, there was a need to create a voice model based on the learning data.
Whenever a voice model was created for new words, it was necessary to gather several hundred people in order to create learning data for learning the voice model. Consequently, a great amount of time was required to create the voice model, with another problem being that it was high in terms of cost.
SUMMARY OF THE INVENTION
In accordance with the present invention, a voice recognition model of words which are not included in the database can be created by creating a learning data of several hundred people using the spoken data of a selected individual or several people. It is thus an object of the present invention to provide a voice model learning data creation method and a voice model apparatus that can generate a voice model for new words in a short period of time and at low cost.
The voice model learning data creation method according to the present invention creates learning data in order to learn the speech model of voice recognition. The voice model learning data creation method creates standard speaker data for spoken data of at least one individual from among the spoken data obtained from a number of speakers which are held in a preestablished database. In addition, learning speaker data is obtained from the database. A conversion coefficient is created for converting standard spoken data into learning speaker data using the preestablished word data. In order to create the learning data for new words, data is obtained from standard speakers which speak the new words, and the data is converted to the learning speaker data space using the conversion coefficient. Thus, learning data is created for new words.
In the case when a voice model is created for new words which do not exist in the database, a voice model can be created from the learning data of the words on the basis of the speaker data of a few individual standard speakers. In order to create a speech model relative to the new words, the need for creating learning data by collecting the speaker data of several hundred individuals as with the conventional art is no longer necessary, and a voice model is created in a short time and at low cost.
In addition, data which exists in the standard spoken data space and the learning spoken data space is stored as a characteristic vector for the respective words obtained by analyzing the frequency of voice signals. The process for converting the data obtained from standard speakers who say new words is accomplished by using differential vectors for the characteristic vectors representing the respective new words in the standard speaker data space and in the learning speaker data space.
If a characteristic vector (for example, data which is manifest by an LPC (cepstrum—phonetic) coefficient having 10 dimensions) obtained by the frequency analysis of voice signals is used, high precision data is obtained. Furthermore, since utilization is made of preobtained differential vectors and conversion of data is made from the standard speaker data space to the learning speaker data space, data conversion is accomplished simply and with high precision.
In addition, data existing in the standard speaker data space and the learning speaker data space is code data which quantizes the characteristic vectors for each of the respective words obtained through the frequency analysis of the voice signal. In addition, the process for converting the data obtained from the speech of standard speakers for new words converts it to the learning speaker data space using the conversion coefficient. The process accomplishes the data conversion of the code data which obtains the vector quantized code data from the standard speaker data for new word data and converts the data from the standard speaker data space to the learning speaker data space by mapping in the learning speaker data space.
In other words, the invention accomplishes processing by vector quantizing the characteristic vectors obtained through the frequency analysis of voice signals. Although the data becomes slightly rough, the processing time is shortened and simplified.
In addition, the voice model learning data creation apparatus of the present invention creates learning data in order to learn the voice model used in voice recognition. The apparatus is provided with a standard speaker data storage component which stores the spoken data of at least one individual selected from the spoken data obtained from many individuals which is held in a preestablished database. A learning speaker data storage component stores spoken data of other than standard speakers as a learning speaker database. An artificial learning word data creation component has a data conversion component which, using a preobtained conversion coefficient, accomplishes data conversion from the standard speaker data space to the learning speaker data space. An effective learning data component stores the data created by the artificial learning word data creation
Aizawa Tadashi
Hasegawa Hiroshi
Inazumi Mitsuhiro
Miyazawa Yasunaga
Abebe Daniel
Korzuch William
Oliff & Berridge
Seiko Epson Corporation
LandOfFree
Voice model learning data creation method and its apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Voice model learning data creation method and its apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Voice model learning data creation method and its apparatus will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2940927