Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-11-22
2003-02-11
Banks-Harold, Marsha D. (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S250000, C704S256000
Reexamination Certificate
active
06519563
ABSTRACT:
FILED OF THE INVENTION
The present invention relates generally to the field of speaker verification systems and more particularly to a method for creating background models for use therewith.
BACKGROUND OF THE INVENTION
Speaker verification is the process of verifying the identity of a speaker based upon an analysis of a sample of his or her speech using previously saved information. More particularly, speaker verification consists of making a determination as to whether the identity of a speaker is, in fact, the same as an identity being claimed therefor (usually by the speaker himself or herself). Some applications of speaker verification include, for example, access control for a variety of purposes, such as for telephones, computer networks, databases, bank accounts, credit-card funds, automatic teller machines, building or office entry, etc. Automatic verification of a person's identity based upon his or her voice is quite convenient for users, and, moreover, it typically can be implemented in a less costly manner than many other biometric methods such as, for example, fingerprint analysis. Moreover, speaker verification is fully non-intrusive, unlike such other biometric methods. For these reasons, speaker verification has recently become of particular importance in mobile and wireless applications.
Typically, speaker verification is performed based upon previously saved information which, at least in part, represents particular vocal characteristics of the speaker whose identity is to be verified. Specifically, the speech signal which results from a speaker's “test” utterance (i.e., an utterance offered for the purpose of verifying the speaker's identity) is analyzed to extract certain acoustic “features” of the speech signal. Then, these features are compared with corresponding features which have been extracted from previously uttered speech spoken by the same individual.
The previously uttered speech which is used for comparison purposes most commonly, but not necessarily, consists of a number of repetitions of the same word or phrase as the one which is to be spoken as the “test” utterance. In any case, the previously uttered speech is referred to as “training” speech, and it is provided to the system as part of an “enrollment” session. If the same word or phrase is used for both the training utterances and the test utterance, the process is referred to as “text dependent” or “fixed phrase” speaker verification. If, on the other hand, the speaker is permitted to use any speech as a test utterance, the process is referred to as “text independent” speaker verification, and operates based solely on the general vocal characteristics of the speaker. The latter approach clearly provides more flexibility, but it is not nearly as robust in terms of verification accuracy as a fixed phrase approach.
Specifically, the speaker's claimed identity is verified (or not), based on the results of a comparison between the features of the speaker's test utterance and those of the training speech. In particular, the previously uttered speech samples are used to produce speech “models” which may, for example, comprise stochastic models such as hidden Markov models (HMMs), well known to those of ordinary skill in the art. (Note that in the case of text independent speaker verification, these models are typically atemporal models, such as, for example, one state HMMs, thereby capturing the general vocal characteristics of the speaker but not the particular selection and ordering of the uttered phonemes.)
The model which is used for comparison with the features extracted from the speech utterance is known as a “speaker dependent” model, since it is generated from training speech of a particular, single speaker. Models which are derived from training speech of a plurality of different speakers are known as “speaker independent” models, and are commonly used, for example, in speech recognition tasks. In its simplest form, speaker verification may be performed by merely comparing the test utterance features against those of the speaker dependent model, determining a “score” representing the quality of the match therebetween, and then making the decision to verify (or not) the claimed identity of the speaker based on a comparison of the score to a predetermined threshold. One common difficulty with this approach is that it is particularly difficult to set the threshold in a manner which results in a reasonably high quality of verification accuracy (i.e., the infrequency with which misverification—either false positive or false negative results—occurs). In particular, the predetermined threshold must be set in a speaker dependent manner—the same threshold that works well for one speaker is not likely to work well for another.
Addressing this problem, it has long since been determined that a substantial increase in verification accuracy can be obtained if a speaker independent “background model” is also compared to and scored against the test utterance, and if the ratio of the scores (i.e., the score from the comparison with the speaker dependent model divided by the score from the comparison with the background model) is compared to a predetermined threshold instead. Moreover, in this case, it is usually possible to choose a single predetermined value for the threshold, used for all speakers to be verified (hereinafter referred to as “customers”), and to obtain a high quality level of verification accuracy therewith. Both of these advantages of using a background model for comparison purposes result from the effect of doing so on probability distributions of the resultant scores. In particular, using such a background model increases the separation between the probability distribution of the actual customer scores (i. e., the scores achieved when the person who actually trained the speaker dependent model provides the test utterance) and the probability distribution of imposter scores (i.e., the scores achieved when some other person provides the test utterance). Thus, it is easier to set an appropriate threshold value, and the accuracy of the verification results improve.
Some studies of speaker verification systems using speaker independent background models advocate that the background model should be derived from speakers which have been randomly selected from a speaker independent database. (See, e.g., D. Reynolds, “Speaker Identification and Verification Using Gaussian Mixture Speaker Models,” Speech Communication, vol. 17: 1-2, 1995.) Other studies suggest that speakers which are acoustically “close” to the person having the claimed identity (i.e., “cohort” speakers) should be selected for use in generating the background model, since these speakers are representative of the population near the claimed speaker. (See, e.g., A. E. Rosenberg et al., “The Use of Cohort Normalized Scores for Speaker Verification,” Proc. Int. Conf. on Spoken Language Processing, Banff, Alberta, Canada, 1992.) By using such a selection of speakers, this latter approach claims to improve the selectivity of the system as against voices which are similar to that of the customer, thereby reducing the false acceptance rate of the system.
Specifically, most state-of-the-art fixed phrase (i.e., text dependent) speaker verification systems verify the identity of the speaker through what is known in the art as a Neyman-Pearson test, based on a normalized likelihood score of a spoken password phrase. (See, e.g., A. L. Higgins et al., “Speaker Verification Using Randomized Phrase Prompting,” Digital Signal Processing, 1:89-106, 1991.) If &lgr;
c
is the customer model (i.e., the speaker dependent model generated from the enrollment session performed by the particular customer), then given some set of acoustic observations X (i. e., features derived from the test utterance), then the normalized score s
norm
(X, &lgr;
c
) is typically computed as being the ratio of the “likelihoods” as follows:
s
norm
⁡
(
X
,
λ
c
)
=
p
(
X
⁢
&LeftBracketingBar;
λ
c
)
p
(
X
⁢
&LeftBracketingBa
Lee Chin-Hui
Li Qi P.
Siohan Olivier
Surendran Arun Chandrasekaran
Abebe Daniel
Banks-Harold Marsha D.
Brown Kenneth M.
Lucent Technologies - Inc.
LandOfFree
Background model design for flexible and portable speaker... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Background model design for flexible and portable speaker..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Background model design for flexible and portable speaker... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3154378