Word-spotting speech recognition device and system

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S002000, C704S252000, C704S236000

Reexamination Certificate

active

06230126

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a speech-recognition device.
2. Description of the Related Art
One of the well-known methods of speech recognition is to attend to speech recognition based on speech-frame detection. This scheme determines a start and an end of a speech frame to be recognized by using power information of the speech or the like, and performs a recognition process based on information obtained from the speech frame.
FIG. 1
is a flowchart of a method of recognizing speech based on speech-frame detection. In the speech recognition based on the speech-frame detection, a recognition process is started (step S
1
), and speech frames are detected as a speaker produces a speech (step S
2
). Speech information obtained from a speech frame is matched against a dictionary pattern (step S
3
), and a recognition object (a word in the dictionary) is output as a recognition result when this object exhibits the highest similarity (step S
4
). At the step S
2
, a beginning of a speech frame can be easily detected based on power information. An end of a speech frame, however, is detected when a silence continues to be present for more than a predetermined time period. This measure is taken in order to insure that a silence before a plosive consonant and a silence of a double consonant are differentiated.
A period of silence for detecting an end of a speech frame, however, is generally as long as about 250 msec to 350 msec because of a need to differentiate a silence of a double consonant. In this scheme, therefore, a recognition result is not available until the end of the time period of 250 msec to 350 msec after a completion of speech input. This makes a recognition system which is slow in response. If the period of silence for detecting the end of a speech frame is shortened for the sake of faster response, an erroneous recognition result may be obtained because the result of a double consonant comes out before the end of a speech.
It is often observed that a speaker makes redundant sounds irrelevant to recognition of speech as in a situation where he/she may say “ah”, “oh”, etc. Since matching with a dictionary is started at a beginning of a speech frame when the speech frame is subjected to a recognition process, such redundant voices as “ah” and “oh” hinder detection of similarities, and result in an erroneous recognition result.
A word spotting scheme is designed to counter various drawbacks described above.
FIG. 2
is a flowchart of a process of a word spotting scheme. In this scheme, a recognition process is started (step S
11
), and speech information is matched against a dictionary without detecting a speech frame as a speaker makes a speech (step S
12
). Then, a check is made as to whether a detected similarity measure exceeds a predetermined threshold value (step S
13
). If it does not, a procedure goes back to the step S
12
to continue matching of speech information against the dictionary. If the similarity measure exceeds the threshold at the step S
13
, a recognition object corresponding to this similarity measure is output as a recognition result (step S
14
). The word spotting scheme does not require detection of a speech frame, so that it facilitates implementation of a system having a faster response. Also, the word spotting scheme takes redundant words away from a speech before outputting recognition results, thereby providing a better recognition result.
The word spotting scheme has its own drawback as described in the following. In the word spotting scheme, no speech frame is detected, and matching against a dictionary is conducted consecutively. If a result of the matching exceeds a threshold, a recognition result is obtained. Otherwise, the matching process is continued. Since the matching process is kept running regardless of the speaker's action, the recognition result obtained from this process may be output even when the speaker is not voicing a word to be recognized. This is called fountaining. For example, fountaining is observed when the speaker is not talking to the recognition device but is talking with someone around him/her.
A method of implementing the word spotting scheme can be found, for example, in “Method of Recognizing Word Speech Using a State Transition Model of Continuous Time Control Type”, Journal of the Institute of Electronics, Information and Communication Engineers, vol. J72-D-II, No.11, pp.1769-1777 (1989). According to the method disclosed in this document, data indicative of a time length is attached to phonemics in a dictionary or codebook. As a result, an improved recognition performance is obtained while reducing the amount of computation. In this method, however, a dictionary of recognized words is compiled by connecting phonemics using an average time length of each phonemic. Because of this, a long word in the dictionary may not correspond to an actually spoken word in terms of the time length of the word. This is because there is a psychological tendency that a speaker tries to speak a shorter word and a longer word in an equal time length. Further, when the speaker is excited, speech may become faster, and voice may be raised. In such situations, a speech-recognition device may experience a degradation in matched similarity measures, and may suffer a drop in a recognition performance. If the speech-recognition device uses the time length as a parameter, a speed of making a speech for a given speaker may be far different from a time length stored in a standard dictionary.
In this manner, the related-art voice-recognition device compiles words of a dictionary by connecting phonemics using an average time length of each phonemic. Because of this, there may be a discrepancy in a time length between a word in the dictionary and an actually spoken word, resulting in a degradation in recognition performance.
Accordingly, there is a need for a speech-recognition device which can enhance a recognition performance by updating time-length parameters in a standard dictionary in accordance with a time length of an actually spoken word.
SUMMARY OF THE INVENTION
Accordingly, it is a general object of the present invention to provide a speech-recognition device which can satisfy the need described above.
It is another and more specific object of the present invention to provide a speech-recognition device which can enhance a recognition performance by updating time-length parameters in a standard dictionary in accordance with a time length of an actually spoken word.
In order to achieve the above objects according to the present invention, a device for speech recognition includes a dictionary which stores features of recognition objects, a matching unit which compares features of input speech with the features of the recognition objects, and a dictionary updating unit which updates time lengths of phonemics in the dictionary based on the input speech when the matching unit finds substantial similarities between the input speech and one of the recognition objects.
According to another aspect of the present invention, the device as described above further includes a feature-extraction unit which extracts the features of input speech from the input speech without detecting speech frames, and wherein the matching unit compares the features of input speech with the features of the recognition objects so as to produce a similarity measure continuously without breaks of speech frames, and the dictionary updating unit updates the time lengths of phonemics when the similarity measure exceeds a predetermined threshold.
According to another aspect of the present invention, the device as described above is such that the dictionary updating unit compares a sum of the time lengths of phonemics constituting the one of the recognition objects with an actual time length of the input speech corresponding to the one of the recognition objects, and updates the time lengths of the phonemics in the dictionary based on a difference between the sum and the actual time length.
According to anoth

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Word-spotting speech recognition device and system does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Word-spotting speech recognition device and system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Word-spotting speech recognition device and system will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2552936

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.