Data processing: speech signal processing – linguistics – language – Speech signal processing – Application
Reexamination Certificate
1998-03-02
2001-08-07
Korzuch, William (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Application
C704S271000, C704S275000
Reexamination Certificate
active
06272466
ABSTRACT:
BACKGROUND OF THE INVENTION
This invention relates to a technology used in a field wherein voice information is coded and input to an information machine such as a computer or a wordprocessor, and in particular is appropriate for detecting voice information in a noisy environment or a conference, etc., where many people talk at the same time. The technology is also used as a voice input apparatus for providing barrier-free machines enabling smooth information transmission to deaf-and-dumb persons, hard-of-hearing persons, and aged people.
The voice input apparatus of a machine aims at enabling user's voice to be input precisely and moreover at high speed in any environment. Hitherto, breath apparatuses for analyzing voice frequency, thereby recognizing and processing speech have been proposed. However, in such a speech recognition technique, degradation of the recognition percentage in an environment wherein noise occurs is at stake. To prevent this problem, it is desirable to get utterance information from information other than voice. Human being vocal organs involved directly in producing a voice are lungs
901
of an air stream mechanism, a larynx
902
of a voice producing mechanism, an oral cavity
903
and nasal cavity
904
are taking charge of ora-nasal process, and lips
905
and a tongue
906
governing articulation process, as shown in
FIG. 9
, although the classification varies from one technical document to another. Research on getting utterance information from visual information of the lips
905
has been conducted as a technology for hearing handicapped persons. Further, it is pointed out that speech recognition accuracy is enhanced by adding visual information of a motion of the lips
905
of the speaker to voice information (C. Bregler, H. Hild, S. Manke and A. Waible, “Improving connected letter recognition by lipreading,” Proc. IEEE ICASSP, pp. 557-560, 1993, etc.,).
An image processing technique using images input through a video camera is most general as a breath recognition technique based on visual information of lips. For example, in the Unexamined Japanese Patent Application Publication No. Hei 6-43897, images of ten diffuse reflective markers M
0
, M
1
, M
2
, M
3
, M
4
, M
5
, M
6
, M
7
, M
8
, and M
9
attached to the lips
905
of a speaker and the surroundings of the lips are input to a video camera, two-dimensional motion of the markers is detected, five lip feature vector components
101
,
102
,
103
,
104
, and
105
are found, and lip motion is observed (FIG.
10
). In the Unexamined Japanese Patent Application Publication No. Sho 52-112205, positions of black markers put on lips and periphery thereof are read from on video camera scanning lines for improving speech recognition accuracy. Although no specific description on a marker extraction method is given, the technique requires two-dimensional image preprocessing and feature extraction technique for discriminating density differences caused by shadows produced by a nose and lips, mustache, beard, whiskers, and skin color differences, and moles, scars, etc., from markers. To solve this problem, in the Unexamined Japanese Patent Application Publication No. Sho 60-3793, a lip information analysis apparatus is proposed which is accomplished by putting four high-brightness markers such as light emitting diodes on lips for facilitating marker position detection, photographing motion of the markers with a video camera, and executing pattern recognition of voltage waveforms provided by a position sensor called a high-speed multipoint X-Y tracker. However, to detect voice in a light room, the technique also requires means for preventing noise of a high-brightness reflected light component produced by spectacles, gold teeth, etc., of a speaker. Thus, it requires preprocessing and feature extraction technique of two-dimensional images input through a television camera, but the technique is not covered in the Unexamined Japanese Patent Application Publication No. Sho 60-3793. Several apparatuses for inputting lips and surroundings thereof directly into a video camera without using markers and performing image processing for feature extraction of vocal organs are also proposed. For example, in the Unexamined Japanese Patent Application Publication No. Hei 6-12483, an image of lips and surroundings thereof is input into a camera and is processed to produce a contour image and a vocalized word is estimated by a back propagation method from the contour image. Proposed in the Unexamined Japanese Patent Application Publication No. Sho 62-239231 is a technique for using a lib opening area and a lip aspect ratio for simplifying lip image information. Designed in the Unexamined Japanese Patent Application Publication No. Hei 3-40177 is a speech recognition apparatus which has the correlation between utterance sound and lip motion as a database for recognizing unspecific speakers. However, the conventional methods handle only position information provided from two-dimensional images of lips and periphery thereof and is insufficient to determine phonemes having delicate lip angle change information and skin contraction information. The conventional two-dimensional images processing methods having large amounts of information to extract markers and features, thus are not appropriate for speeding up.
Several methods without using a video camera are proposed; techniques of extracting utterance information from an electromyogram (EMG) of the surroundings of lips are proposed. For example, in the Unexamined Japanese Patent Application Publication No. Hei 6-12483, an apparatus using binarization information of an EMG waveform is designed as alternative means to image processing. In Kurita et al., “A Physiological Model for the Synthesis of Lip Articulation,” (The Journal of the Acoustical Society of Japan, Vol. 50, No. 6 (1994), pp. 465-473), a model for calculating a lip shape from an EMG signal is designed. However, the utterance information extraction based on the EMG involves a problem of a large load on the speaker because electrodes with measurement cords must be put on the surroundings of the lips of the speaker. Several techniques of attaching an artificial palate for obtaining a palatographic signal, thereby detecting a tongue motion accompanying voice producing of a speaker for use as a voice input apparatus are also invented. For example, in the Unexamined Japanese Patent Application Publication No. Sho 55-121499, means for converting the presence or absence of contact between a transmission electrode attached to an artificial palate and a tongue into an electric signal is proposed. In the Unexamined Japanese Patent Application Publication No. Sho 57-160440, the number of electrodes attached to an artificial palate is decreased for making good tongue touch. In the Unexamined Japanese Patent Application Publication No. Hei 4-257900, a palatographic light reception signal is passed through a neural network, whereby unspecific speakers can be covered. In addition to use of a tongue motion, a device of bringing the bush rod tip into a soft palate, thereby observing vibration of the soft palate is proposed in the Unexamined Japanese Patent Application Publication No. Sho 64-62123. However, the device needs to be attached to the inside of a human body, thus there is a possibility that a natural speech action may be disturbed, and the load on the speaker is also large. It is desirable to eliminate the need for contacting the human body as much as possible as a utterance state detection apparatus or device.
A position detection method according to prior technology for putting markers is shown by taking the Unexamined Japanese Patent Application Publication No. Hei 6-43897 as an example (FIG.
10
). In the prior technology, images of markers M
0
, M
1
, . . . , M
9
are input from the front where the feature of lips
905
and the periphery thereof can be best grasped. Thus, position of the markers movement accompanying utterance up and down
101
,
102
,
104
and from side to side
103
,
105
can be det
Fukui Motofumi
Harada Masaaki
Shimizu Tadashi
Takeuchi Shin
Chawan Vijay
Fuji Xerox Co. LTD
Korzuch William
Oliff & Berridg,e PLC
LandOfFree
Speech detection apparatus using specularly reflected light does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Speech detection apparatus using specularly reflected light, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech detection apparatus using specularly reflected light will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2493620