Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2002-01-08
2004-07-06
{haeck over (S)}mits, T{overscore (a)}livaldis Ivars (Department: 2655)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S234000, C704S246000
Reexamination Certificate
active
06760701
ABSTRACT:
BACKGROUND OF THE INVENTION
The invention directed to an automatic speaker verification (ASV) system and method useful for storing and processing voice signals to automatically ascertain the identity of an individual.
1. Field of the Invention
The invention relates to the fields of digital speech processing and speaker recognition.
2. Description of Related Art
In many situations it is desired to verify the identity of a person, such as a consumer. For example, in credit card transactions, it is important to confirm that a consumer presenting a credit card (or credit card number) to a merchant is authorized to use the credit card. Currently, the identity of the consumer is manually verified by the merchant. The back of the credit card contains a signature strip, which the consumer signs upon credit card issuance. The actual signature of the consumer at the time of sale is compared to the signature on the back of the credit card by the merchant. If in the merchant's judgement, the signatures match, the transaction is allowed to proceed.
Another systems of the prior art includes placing a photograph of an authorized user on the credit card. At the time of the transaction, the merchant compares the photograph on the card with the face of the person presenting the card. If there appears to be a match, the transaction is allowed to proceed.
However, these prior art methods have serious drawbacks. These systems are manual and consequently prone to human error. Signatures are relatively easy to forge and differences between signatures and photographs may go unnoticed by inattentive merchants. Further, these systems cannot be used with credit card transactions which do not occur in person, for example, transactions which occur via telephone.
Voice verification systems, sometimes known as automatic speaker verification (ASV) systems, attempt to cure the deficiencies of these prior art methods. These systems attempt to match the voice of the person whose identity is undergoing verification with a known voice.
One type of voice recognition system is a text-dependent automatic speaker verification system. The text-dependent ASV system requires that the user speak a specific password or phrase (the “password”). This password is determined by the system or by the user during enrollment. However, in most text-dependent ASV systems, the password is constrained to be within a fixed vocabulary, such as a limited number of numerical digits. The limited number of password phrases gives an imposter a higher probability of discovering a person's password, reducing the reliability of the system.
Other text-independent ASV systems of the prior art utilize a user-selectable password. In such systems, the user enjoys the freedom to make-up his/her own password with no constraints on vocabulary words or language. The disadvantage of these types of systems is that they increase the processing requirement of the system because it is much more technically challenging to model and verify a voice pattern of an unknown transcript (i.e. a highly variable context).
Modeling of speech has been done at the phrase, word, and subword level. In recent years, several subword-based speaker verification systems have been proposed using either Hidden Markov Models (“HMM”) or Artificial Neural Network (“ANN”) references. Modeling at the subword level expands the versatility of the system. Moreover, it is also conjectured that the variations in speaking styles among different speakers can be better captured by modeling at the subword level.
Another challenge posed under real-life operating environments is that noise and background speech/music may be detected and considered as part of the password. Other problems with transmission or communications systems is that channel-specific distortion occurs over channels, such as transducers, telephone lines and telephone equipment which connect users to the system. Further, ASV systems using modeling need to adapt to changes in the user and to prior successful and unsuccessful attempts at verification.
What is needed are reliable systems and methods for automatic speaker verification of user selectable phrases.
What is needed is a user-selectable ASV system in which accuracy is improved over prior ASV systems.
What is needed is a word or phrase detector which can identify key portions of spoken password phrases over background noise.
What is needed is channel adaptation to adapt a system in response to signals received over different channels.
What is needed is fusion adaptation to adapt a system in response to previous errors and successes.
What is needed is threshold adaptation to adapt a system in response to previous errors and successes.
What is needed is model adaptation to adapt underlying a system model components in response to previous successes.
SUMMARY OF THE INVENTION
The voice print system of the present invention builds and improves upon existing ASV systems. The voice print system of the present invention is a subword-based, text-dependent automatic speaker verification system that embodies the capability of user-selectable passwords with no constraints on the choice of vocabulary words or the language. Automatic blind speech segmentation allows speech to be segmented into subword units without any linguistic knowledge of the password. Subword modeling is performed using a discriminant training-based classifier, namely a Neural Tree Network (NTN). The present NTN is a hierarchical classifier that combines the properties of decision trees and feed-forward neural networks. The system also takes advantage of such concepts as multiple classifier fusion and data resampling to successfully boost performance.
Key word/key phrase spotting is used to optimally locate the password. Channel adaptation removes the nonuniform effects of different environments which lead to varying channel characteristics, such as distortion. Channel adaptation is able to remove the characteristics of the test channel and/or enrollment channel to increase accuracy.
Fusion adaptation is used to dynamically change the weight accorded to the individual classifier models, which increases the flexibility of the system. Threshold adaptation dynamically alters the threshold necessary to achieve successful verification. Threshold adaptation is useful to incrementally change false-negative results. Model adaptation gives the system the capability to retrain the classifier models upon the occurrence of subsequent successful verifications.
The voice print system can be employed for user validation for telephone services such as cellular phone services and bill-to-third-party phone services. It can also be used for account validation for information system access.
All ASV systems include at least two components, an enrollment component and a testing component. The enrollment component is used to store information concerning a user's voice. This information is then compared to the voice undergoing verification (testing) by the test component. The system of the present invention includes inventive enrollment and testing components, as well as a third, “bootstrap” component. The bootstrap component is used to generate data which assists the enrollment component to model the user's voice.
1. Enrollment Summary
An enrollment component is used to characterize a known user's voice and store the characteristics in a database, so that this information is available for future comparisons. The system of the present invention utilizes an improved enrollment process. During enrollment, the user speaks the password, which is sampled by the system. Digital to analog conversion (if necessary) is conducted to obtain digital speech samples. Preprocessing is performed to remove unwanted silence and noise from the voice sample, and to indicate portions of the voice sample which correspond to the user's voice.
Next, the transmission channel carrying the user's enrollment voice signal is examined. The characteristics of the enrollment channel are estimated and stored in a database. The databa
Mammone Richard J.
Sharma Manish
Zhang Xiaoyu
Merchant & Gould P.C.
T-NETIX, Inc.
{haeck over (S)}mits T{overscore (a)}livaldis Ivars
LandOfFree
Subword-based speaker verification using multiple-classifier... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Subword-based speaker verification using multiple-classifier..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Subword-based speaker verification using multiple-classifier... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3221037