Multi-resolution system and method for speaker verification

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S243000, C704S273000, C704S253000

Reexamination Certificate

active

06272463

ABSTRACT:

TECHNICAL
The present invention relates to digital speech processing, and more particularly, to verification of the identity of a given speaker.
BACKGROUND ART
Speech possesses multiple acoustic characteristics which vary greatly between individuals according to such diverse factors as vocal tract size, gender, age, native dialect, education, and idiosyncratic articulator movements. These factors are so specifically correlated to individual speakers that listeners often can readily determine the identity of a recognized speaker within the first few syllables heard. Considerable effort has been expended to develop artificial systems which can similarly determine and verify the identity of a given speaker.
Speaker verification systems may be broadly divided into free-text passphrases systems and text-dependent systems. Each type of system has its difficulties. To accommodate free-text passphrases, the storage and match processes must accommodate virtually any utterance. This higher acoustic-phonetic variability imposes longer training sessions in order to reliably characterize, or model, a speaker. In addition, free-text systems are not able to model speaker specific co-articulation effects caused by the limited movements of the speech articulators. Moreover, the ability to accommodate virtually any utterance exists in tension with the ability to discriminate among a wide range of speakers—the greater the vocabulary range, the more challenging it is to simultaneously provide both reliable word storage, and discriminate among speakers.
Text-dependent systems, on the other hand, permit easier discrimination between multiple speakers. In text-dependent passphrase systems, one or more preselected passphrases is modeled for each individual user. The models reflect both individual-specific acoustic characteristics as well as lexical and syntactic content of the passphrase. In contrast to free-text systems, fairly short utterances (typically, just a few seconds) are adequate for training in text-dependent systems. However, too narrow a scope of acceptable text may make a text-dependent system more vulnerable to replay attack. Text-dependent systems can be further sub-classified as either fixed passphrase systems, where the passphrase was defined at design time, or as freely chosen passphrase systems equipped with an online training procedure. The specific techniques utilized correspond generally to the recognized techniques of automatic speech recognition—acoustic templates, hidden Markov models (HMM), artificial neural networks, etc.
Text-prompted approaches with multiple passphrases were introduced in order to enhance security against playback recordings. Each verification session requires a speaker seeking to be verified to speak a different pseudo-random sequence of words for which the system has speaker-dependent models. Thus, the required verification sentence cannot be predicted in advance, inhibiting an unauthorized speaker from pre-recording the speech of an authorized user. With the current state of the art in speech processing, however, it is realistic to imagine a computer system which is equipped with a speech recognition engine, and which has the fixed vocabulary of text segments defined. If a prerecording of all text fragments of a certain speaker is available to the computer, a speech recognition engine could be used to decode the random combination of text prompted for, and a computer program could assemble the corresponding pre-recorded speech segments. Text-prompted systems do, however, suffer from the same co-articulation problems as free-text systems.
A method called cohort normalization partially overcomes some problems of text-prompted systems by using likelihood ratio scoring. Cohort normalization is described, for example, in U.S. Pat. No. 5,675,704 to Juang et al. and U.S. Pat. No. 5,687,287 to Gandhi et al, the disclosures of which are hereby incorporated herein by reference. Likelihood ratio scoring requires that the same contexts be represented in the models of the different authorized speakers. Normalizing scores are obtained from individual reference speakers, or by models generated by pooling reference speakers. Models of bona fide registered speakers that are acoustically close to the claimant speaker are used for score normalization.
It has been shown that the cohort normalization technique can be viewed as providing a dynamic threshold which partially compensates for trial-to-trial variations. In particular, the use of cohort normalization scores compensates to some extent for microphone mismatch between a training session and subsequent test sessions. Cohort normalization has been successfully introduced in free-text systems as well, where a full acoustic model should be generated from each concurrent speaker. Speaker verification systems using cohort normalization are intrinsically language dependent, however, and speaker independent models are not commonly used for normalization purposes, mainly due to the mismatch in model accuracy of the speaker independent model and the rather poorly trained speaker dependent models.
Speaker verification systems have characterized input utterances by use of speaker-specific sub-word size (e.g., phoneme-size) hidden Markov models (HMMs). This approach changes the key text each time the system is used, thereby addressing the problem of replay attack. The speaker-specific sub-word models can be generated by speaker adaptation of a speaker-independent models. Speaker-dependent sub-word models are created for each reference speaker. These systems again need extensive training sessions.
The following references are pertinent to the present invention:
Higgins et al., “Speaker Verification Using Randomized Phrase Prompting,”
Digital Signal Processing
, March 1991, pp. 89-106.
A. E. Rosenberg et al., “The Use of Cohort Normalized Scores for Speaker Verification,”
Proc
. 1992
ICSLP
, October. 1992, pp. 599-602.
F. K. Soong et al., “A Vector Quantisation Approach to Speaker Verification,”
IEEE
1985, pp. 38714 390.
A. E. Rosenberg et al., “Sub-word Unit Talker Verification Using Hidden Markov Models,”
IEEE
1990, pp. 269-272.
T. Masui et al., “Concatenated Phoneme Models for Text-variable Speaker Recognition,”
IEEE
1993, pp. 391-394.
J. Kuo et al., “Speaker Set Identification Through Speaker Group Modeling,”
BAMFF ‘
92 ’.
Each of the foregoing references in its entirety is hereby incorporated herein by reference.
SUMMARY OF THE INVENTION
In accordance with a preferred embodiment of the invention, there is provided a method for generating a speaker-dependent model of an utterance that has at least one occurrence. In this embodiment, the method includes the steps of generating an initial model, having a first resolution, that encodes each of the occurrences of the utterance and also generating at least one additional speaker-specific model, having a different resolution from that of the initial model, of all occurrences of the utterance.
In accordance with a further embodiment, the initial model is speaker independent. Also in further embodiments, the at least one additional model has higher resolution than the initial model and is boot strapped from the initial model. In an alternative embodiment, the at least one additional model has lower resolution than the initial model and is derived from the initial model. In yet another embodiment, at least one of the models has a resolution on the subphoneme level. In accordance with a further embodiment, there is provided the additional step of determining the difference in the degree of match of the initial model and the at least one additional model to a new utterance, so as to permit discrimination on the basis of at least one of (i) speaker of the new utterance and (ii) content of the new utterance. An embodiment also additionally may utilize the match to the initial model to enhance robustness against mismatches in training and use sessions. Another embodiment of the present invention includes a method of a speaker verification system for generating multi-resolution m

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Multi-resolution system and method for speaker verification does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Multi-resolution system and method for speaker verification, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multi-resolution system and method for speaker verification will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2462106

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.