Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-10-30
2004-03-02
McFadden, Susan (Department: 2655)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S239000
Reexamination Certificate
active
06701292
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a speech-recognizing apparatus. More particularly, the present invention relates to improvement of a speech-recognition rate in a noisy environment and reduction of the amount of speech-recognition processing.
2. Description of the Related Art
In recent years, presentation of products each including a speech-recognizing function has been becoming popular. However, speech-recognition technologies of the present state of the art have a problem of an inability to display good performance without restrictive conditions such as a requirement that the technologies be applied in a quiet environment. Such restrictions serve as a big barrier to popularization of the speech recognition, raising a demand for improvement of a speech-recognition rate in a noisy environment. One of conventional speech-recognition methods for improvement of a voice-recognition rate in a noisy environment is disclosed in Japanese Patent Laid-open No. Hei5-210396. This disclosed method is referred to hereafter as a method of the first prior art. The first prior art provides a method for correcting a similarity between vectors by using a maximum similarity in the frame of the vectors. To put in detail, in accordance with this method, characteristics of an input audio signal are first analyzed and converted into a sequence of characteristic vectors along the time axis. A similarity between vectors is then found from a distance between a characteristic vector of 1 frame of the time-axis sequence of characteristic vectors and a characteristic vector composing a standard pattern cataloged in advance in accordance with a probability distribution. Then, a maximum value of similarities between vectors is found for each frame.
Subsequently, a correction value is found from the maximum value of similarities between vectors found for each frame. A similarity between vectors is then corrected by using the correction value to produce a corrected similarity. Frame-corrected similarities are then cumulated to result in a cumulative corrected similarity. Subsequently, the cumulative corrected similarity is compared with a predetermined threshold value. If the cumulative corrected similarity is found greater than the threshold value, a voice corresponding to the cumulative corrected similarity is determined to have been input. Since a similarity between vectors is corrected by using a maximum similarity for each frame as described above, effects of noises kill each other, resulting in an improved speech-recognition rate. One of the conventional speech recognition methods for improving the speech-recognition rate in a word-spotting process is disclosed in Japanese Patent Laid-open No. Sho63-254498. This disclosed method is referred to hereafter as a method of the second prior art. This method utilizes a difference between largest and second largest similarities or a ratio of the largest similarity to the second largest similarity. To put it in detail, first of all, a characteristic parameter is extracted from an input voice. Then, a similarity between the extracted characteristic parameter and a characteristic parameter of a standard pattern is found. A cumulative similarity for each standard pattern cumulating similarities is then computed. A cumulative similarity is found by word spotting, which shifts the start point of time and the end point of time of a cumulating interval little by little. Subsequently, the cumulative similarities are sorted to determine the largest and second largest ones. Then, a difference between the largest and second largest similarities or a ratio of the largest similarity to the second largest similarity is compared with a predetermined threshold value. If the difference between the largest and second largest similarities or the ratio of the largest similarity to the second largest similarity is found greater than the threshold value, the input speech is determined to be a word corresponding to the largest cumulative similarity. By comparing a difference between the largest and second largest similarities or a ratio of the largest similarity to the second largest similarity with a predetermined threshold value as described above, only a probable result of recognition is recognized as a word. As a result, the speech-recognition rate is improved.
In the first prior art, a similarity between frames found by using a probability distribution is used in comparison of input speech with a standard pattern. In this case, the effect of the noise can be inferred to a certain degree by using a maximum similarity. If a distance between vectors is used in place of the similarity between frames, however, the minimum value of the vector-to-vector distances varies in dependence on, among others, the type of the phoneme. It is thus difficult to infer an effect of a noise by using the minimum value of the vector-to-vector distances. For this reason, there is raised a problem of impossibility to apply the method according to the first prior art to a case wherein a distance is used in comparison of an input voice with a standard pattern. In the case of the second prior art, on the other hand, the threshold value is set intensely so as to prevent a noise from being determined to be speech. In consequence, when the similarity between input speech and a standard pattern decreases due to the effect of a noise or the like, speech cannot be detected in many cases.
FIG. 14
is a diagram showing a problem of a word-spotting process. Notations A
1
, A
2
, A
3
, A
4
, B
1
, B
2
, B
3
, B
4
, C
1
, C
2
, C
3
and C
4
shown in
FIG. 14
each denote a voice interval in a word-spotting process. It is quite within the bounds of probability that speech exists in each speech interval. The speech intervals have different start and end edges. For each of the speech intervals, a cumulative similarity between frames and a cumulative distance between frames are found by adopting methods such as a DP (Dynamic Programming) matching technique or an HMM technique. In the example shown in
FIG. 14
, the similarity of the speech interval C
2
coinciding with an input voice is a maximum. It is quite within the bounds of probability that speech exists in each speech interval and since cumulative processing is carried out for each of such intervals, the word-spotting process has a problem of a large amount of processing. In order to solve this problem, there has been proposed an end-edge-free method. However, the end-edge-free method has the following problem.
FIG. 15
is a diagram showing the problem of the end-edge-free method. In the case of the end-edge-free method shown in
FIG. 15
, cumulative processing is carried out by identifying a start edge for an interval beginning from the start edge, which is treated as a speech-interval. Since cumulative processing is carried out for speech intervals A, B and C in the case of the end-edge-free method shown in
FIG. 15
instead of the voice intervals A
1
, A
2
, A
3
, A
4
, B
1
, B
2
, B
3
, B
4
, C
1
, C
2
, C
3
and C
4
shown in
FIG. 14
in the word-spotting process, the amount of processing can be reduced. Since a period between the start edge and a speech-input point with a fixed duration in the speech interval is indefinite, however, the end-edge-free method has a problem of a resulting extension. In the case of the voice interval C, for example, a delay &tgr; inevitably results.
SUMMARY OF THE INVENTION
It is thus an object of the present invention addressing the problems described above to provide a speech-recognizing apparatus capable of improving the speech-recognition rate by reducing the effect of a noise in a case of using a distance between frames in comparison of an input voice with a standard pattern.
It is another object of the present invention to provide a speech-recognizing apparatus capable of detecting speech even for a case in which a frame-to-frame distance between input speech and a standard pattern increases or a frame-to-frame similarity between input speech and a standard pattern decr
Katayama Hiroshi
Kawai Chiharu
Nakai Takehiro
Fujitsu Limited
Katten Muchin Zavis & Rosenman
McFadden Susan
LandOfFree
Speech recognizing apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Speech recognizing apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech recognizing apparatus will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3257479