Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-01-20
2002-08-20
Banks-Harold, Marsha D. (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S256000, C704S251000
Reexamination Certificate
active
06438520
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates in general to telecommunication systems, and more particularly, to cross-speaker speech recognition for telecommunication applications.
BACKGROUND OF THE INVENTION
With increasingly sophisticated telecommunication systems, speech recognition technology is increasingly important. For example, speech recognition technology is useful for various automated intelligent network functions, such as for a voice controlled intelligent personal agent that handles a wide variety of call and message functions. The voice controlled intelligent personal agent designed and implemented by Lucent Technologies, for example, includes natural language, voice controlled services such as automatic name (voice) dialing, name (voice) message retrieval and playback, voice messaging, and call screening.
Many current implementations of speech recognition technology are limited to same-speaker recognition. For example, current state-of-the-art voice name dialing requires a subscriber to “train” a set of names, repeatedly speaking the set of names to form a name list. Subsequently, constrained by this set, the speech recognizer will recognize another spoken sample as one of these names in the set, and dial a corresponding associated directory number. Such current systems do not provide for voice name dialing from untrained names or lists. In addition, such current systems do not provide for cross-speaker recognition, in which a name spoken by a subscriber may be recognized as the same name spoken by an incoming caller of the subscriber.
Many current types of speaker-trained speech recognition technologies are also whole word based or template-based, rather than sub-word (phoneme) based. Such whole word or template-based speech recognition technologies attempt to match one acoustic signal to another acoustic signal, generating a distinct and separate statistical model for every word in the recognizer vocabulary set. Such speech recognition technology is highly user specific, and generally does not provide for recognition of the speech of a different speaker. In addition, such template based speech recognition is impractical, expensive and difficult to implement, especially in telecommunication systems.
As a consequence, a need remains for an apparatus, method and system for speech recognition that is capable of recognizing the speech of more than one user, namely, having capability for cross-speaker speech recognition. In addition, such cross-speaker recognition should be sub-word or phoneme-based, rather than whole word or template-based. Such cross-speaker speech recognition should also have high discrimination capability, high noise immunity, and should be user friendly. Preferably, such cross-speaker speech recognition should also utilize a “hidden Markov model” for greater accuracy. In addition, such cross-speaker speech recognition technology should be capable of cost-effective implementation in advanced telecommunication applications and services, such as automatic name (voice) dialing, message management, call return management, and incoming call screening.
SUMMARY OF THE INVENTION
The apparatus, method and system of the present invention provide sub-word, phoneme-based, cross-speaker speech recognition, and are especially suited for telecommunication applications such as automatic name dialing, automatic message creation and management, incoming call screening, call return management, and message playback management and name list generation.
The various embodiments of the present invention provide for such cross-speaker speech recognition utilizing a methodology that provides both high discrimination and high noise immunity, utilizing a matching or collision of two different speech models. A phoneme or subword-based pattern matching process is implemented, utilizing a “hidden Markov model” (“HMM”). First, a phoneme-based pattern or transcription of incoming speech, such as a spoken name, is created utilizing a HMM-based recognizer with speaker-independent phoneme models and an unconstrained grammar, in which any phoneme may follow any other phoneme. In addition, utilizing a HMM-based recognizer with a constrained grammar, the incoming speech is utilized to select or “recognize” a closest match of the incoming speech to an already existing phoneme pattern representing a name or word, if any, i.e., recognition is constrained by existing patterns, such as phoneme patterns representing names. The methodology of the invention then determines likelihood of fit parameters, namely, a likelihood of fit of the incoming speech to the unconstrained, speaker-independent model, and a likelihood of fit of the incoming speech to the selected or recognized existing pattern. Based upon a comparison of these likelihood of fit parameters, the various embodiments of the present invention determine whether the incoming speech matches or, as used equivalently herein, collides with a particular name or word. Such matches or “collisions” are then utilized for various telecommunication applications, such as automatic voice (name) dialing, call return management, message management, and incoming call screening.
A method for cross-speaker speech recognition for telecommunication systems, in accordance with the present invention, includes receiving incoming speech, such as a caller name, generating a phonetic transcription of the incoming speech with a HMM-based, speaker-independent model having an unconstrained phoneme grammar, and determining a transcription parameter as a likelihood of fit of the incoming speech to the speaker-independent, unconstrained grammatical model. The method also selects a first existing phoneme pattern, if any, from a plurality of existing phoneme patterns, as having a highest likelihood of fit to the incoming speech, and also determines a recognition parameter as a likelihood of fit of the incoming speech to the first existing phoneme pattern. The method then determines whether the input speech matches the first existing phoneme pattern based upon a correspondence of the transcription parameter with the recognition parameter in accordance with a predetermined criterion, such as whether a ratio of the two parameters is above or below a predetermined, empirical threshold.
In the various embodiments, the plurality of existing phoneme patterns are generated from a plurality of speakers, such as from subscribers and incoming callers. The incoming speech may also be from any speaker of a plurality of speakers. The plurality of phoneme patterns, in the preferred embodiment, form lists for use by a subscriber, such as a name list, a message list, or both. Any given name may be associated with a variety of phoneme patterns or samples, generated by different speakers, such as by the subscriber and by various incoming callers.
Cross-speaker recognition is provided when a name, as a phoneme pattern spoken by one individual, is matched (or collides with) a phoneme pattern spoken by another individual. For example, a name as spoken by an incoming caller (a person who calls a subscriber) may be recognized as the same name as spoken by the subscriber for automatic call returning.
In the preferred embodiment, the matching or collision determination is performed by comparing the transcription parameter to the recognition parameter to form a confidence ratio. When the confidence ratio is less than a predetermined threshold, the method determines that the input speech matches the first existing phoneme pattern; and when the confidence ratio is not less than the predetermined threshold, the method determines that the input speech does not match the first existing phoneme pattern.
The embodiments are also utilized to generate various lists, such as a name list for automatic name dialing. Generating the name list includes receiving as incoming speech a first sample of a name and, performing collision or matching determination on the first sample. When the first sample does not match the first existing phoneme pattern, a transcription of the first sample is (initially
Curt Carol Lynn
Sukkar Rafid Antoon
Wisowaty John Joseph
Abebe Daniel
Banks-Harold Marsha D.
Gamburd Nancy R.
Lucent Technologies - Inc.
LandOfFree
Apparatus, method and system for cross-speaker speech... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Apparatus, method and system for cross-speaker speech..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus, method and system for cross-speaker speech... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2879164