Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-09-29
2004-02-24
McFadden, Susan (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S250000
Reexamination Certificate
active
06697779
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to speech or voice recognition systems and more particularly to user authentication by speech or voice recognition.
BACKGROUND OF THE INVENTION
The field of user authentication has received increasing attention over the past decade. To enable around-the-clock availability of more and more personal services, many sophisticated transactions have been automated, and remote database access has become pervasive. This, in turn, heightened the need to automatically and reliably establish a user's identity. In addition to standard password-type information, it is now possible to include, in some advanced authentication systems, a variety of biometric data, such as voice characteristics, retina patterns, and fingerprints.
In the context of voice processing, two areas of focus can be distinguished. Speaker identification is the process of determining which registered speaker provides a given utterance. Speaker verification, on the other hand, is the process of accepting or rejecting the identity of a speaker based upon an utterance. Collectively, they refer to the automatic recognition of a speaker (i.e., speaker authentication) on the basis of individual information present in the speech wave form. Most applications in which a voice sample is used as a key to confirm the identity of a speaker are classified as speaker verification. Many of the underlying algorithms, however, can be applied to both speaker identification and verification.
Speaker authentication methods may be divided into text-dependent and text-independent methods. Text-dependent methods require the speaker to say key phrases having the same text for both training and recognition trials, whereas text-independent methods do not rely on a specific text to be spoken. Text-dependent systems offer the possibility of verifying the spoken key phrase (assuming it is kept secret) in addition to the speaker identity, thus resulting in an additional layer of security. This is referred to as the dual verification of speaker and verbal content, which is predicated on the user maintaining the confidentiality of his or her pass-phrase.
On the other hand, text-independent systems offer the possibility of prompting each speaker with a new key phrase every time the system is used. This provides essentially the same level of security as a secret pass-phrase without burdening the user with the responsibility to safeguarding and remembering the pass-phrase. This is because prospective impostors cannot know in advance what random sentence will be requested and therefore cannot (easily) play back some illegally pre-recorded voice samples from a legitimate user. However, implicit verbal content verification must still be performed to be able to reject such potential impostors. Thus, in both cases, the additional layer of security may be traced to the use of dual verification.
In all of the above, the technology of choice to exploit the acoustic information is hidden Markov modeling (HMM) using phonemes as the basic acoustic units. Speaker verification relies on speaker-specific phoneme models while verbal content verification normally employs speaker-independent phoneme models. These models are represented by Gaussian mixture continuous HMMs, or tied-mixture HMMs, depending on the training data. Speaker-specific models are typically constructed by adapting speaker-independent phoneme models to each speaker's voice. During the verification stage, the system concatenates the phoneme models appropriately, according to the expected sentence (or broad phonetic categories, in the non-prompted text-independent case). The likelihood of the input speech matching the reference model is then calculated and used for the authentication decision. If the likelihood is high enough, the speaker/verbal content is accepted as claimed.
The crux of speaker authentication is the comparison between features of the input utterance and some stored templates, so it is important to select appropriate features for the authentication. Speaker identity is correlated with the physiological and behavioral characteristics of the speaker. These characteristics exist both in the spectral envelope (vocal tract characteristics) and in the supra-segmental features (voice source characteristics and dynamic features spanning several segments). As a result, the input utterance is typically represented by a sequence of short-term spectral measurements and their regression coefficients (i.e., the derivatives of the time function of these spectral measurements).
Since HMMs can efficiently model statistical variation in such spectral features, they have achieved significantly better performance than less sophisticated template-matching techniques, such as dynamic time-warping. However, HMMs require the a priori selection of a suitable acoustic unit, such as the phoneme. This selection entails the need to adjust the authentication implementation from one language to another, just as speech recognition systems must be re-implemented when moving from one language to another. In addition, depending on the number of context-dependent phonemes and other modeling parameters, the HMM framework can become computationally intensive.
SUMMARY OF THE INVENTION
A method and system for training a user authentication by voice signal are described. In one embodiment, during training, a set of all spectral feature vectors for a given speaker is globally decomposed into speaker-specific decomposition units and a speaker-specific recognition unit. During recognition, spectral feature vectors are locally decomposed into speaker-specific characteristic units. The speaker-specific recognition unit is used together with selected speaker-specific characteristic units to compute a speaker-specific comparison unit. If the speaker-specific comparison unit is within a threshold limit, then the voice signal is authenticated. In addition, a speaker-specific content unit is time-aligned with selected speaker-specific characteristic units. If the alignment is within a threshold limit, then the voice signal is authenticated. In one embodiment, if both thresholds are satisfied, then the user is authenticated.
REFERENCES:
patent: 5125022 (1992-06-01), Hunt et al.
patent: 5127043 (1992-06-01), Hunt et al.
patent: 5167004 (1992-11-01), Netsch et al.
patent: 5297194 (1994-03-01), Hunt et al.
patent: 5301109 (1994-04-01), Landauer et al.
patent: 5317507 (1994-05-01), Gallant
patent: 5325298 (1994-06-01), Gallant
patent: 5621859 (1997-04-01), Schwartz et al.
patent: 5675819 (1997-10-01), Schuetze
patent: 5712957 (1998-01-01), Waibel et al.
patent: 5839106 (1998-11-01), Bellegarda
patent: 5842165 (1998-11-01), Raman et al.
patent: 5867799 (1999-02-01), Lang et al.
patent: 5895448 (1999-04-01), Vysotsky et al.
Bellegarda, Jerome; “A Latent Semantic Analysis Framework For large-Span Language Modeling”; Proc. EuroSpeech '97 Rhodes; Greece; Sep. 1997; pp. 1451-1454.
Bellegarda, Jerome; “A Multispan language Modeling Framework For Large Vocabulary Speech Recognition”; IEEE Transaction On Speech And Audio Processing; vol. 6, No. 5; Sep. 1998; pp. 456-467.
Bellegarda Jerome
Naik Devang
Neeracher Matthias
Silverman Kim
Apple Computer Inc.
Blakely , Sokoloff, Taylor & Zafman LLP
McFadden Susan
LandOfFree
Combined dual spectral and temporal alignment method for... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Combined dual spectral and temporal alignment method for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Combined dual spectral and temporal alignment method for... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3302246