Model adaptation of neural tree networks and other fused...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S250000

Reexamination Certificate

active

06519561

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a system and method for adapting speaker verification models to achieve enhanced performance during verification and particularly, to a subword based speaker verification system having the capability of adapting a neural tree network (NTN), Gaussian mixture model (GMM), dynamic time warping template (DTW), or combinations of the above, without requiring additional time consuming retraining of the models.
The invention relates to the fields of digital speech processing and speaker verification.
2. Description of the Related Art
Speaker verification is a speech technology in which a person's identity is verified using a sample of his or her voice. In particular, speaker verification systems attempt to match the voice of the person whose identity is undergoing verification with a known voice. It provides an advantage over other security measures such as personal identification numbers (PINs) and personal information, because a person's voice is uniquely tied to his or her identity. Speaker verification provides a robust method for security enhancement that can be applied in many different application areas including computer telephony.
Within speaker recognition, the two main areas are speaker identification and verification. A speaker identification system attempts to determine the identify of a person within a known group of people using a sample of his or her voice. In contrast, a speaker verification system attempts to determine if a person's claimed identity (whom the person claims to be) is valid using a sample of his or her voice.
Speaker verification consists of determining whether or not a speech sample provides a sufficient match to a claimed identity. The speech sample can be text dependent or text independent. Text dependent speaker verification systems verify the speaker after the utterance of a specific password phrase. The password phrase is determined by the system or by the user during enrollment and the same password is used in subsequent verification. Typically, the password phrase is constrained within a fixed vocabulary, such as a limited number of numerical digits. The limited number of password phrases gives the imposter a higher probability of discovering a person's password, reducing the reliability of the system.
A text independent speaker verification system does not require that the same text be used for enrollment and testing as in a text dependent speaker verification system. Hence, there is no concept of a password and a user will be recognized regardless of what he or she speaks.
Speech identification and speaker verification tasks may involve large vocabularies in which the phonetic content of different vocabulary words may overlap substantially. Thus, storing and comparing whole word patterns can be unduly redundant, since the constituent sounds of individual words are treated independently regardless of their identifiable similarities. For these reasons, conventional vocabulary speech recognition and text-dependent speaker verification systems build models based on phonetic subword units.
Conventional approaches to performing text-dependent speaker verification include statistical modeling, such as hidden Markov models (HMM), or template-based modeling, such as dynamic time warping (DTW) for modeling speech. For example, subword models, as described in A. E. Rosenberg, C. H. Lee ad F. K. Soong, “Subword Unit Talker Verification Using Hidden Markov Models”,
Proceedings ICASSP,
pages 269-272 (1990) and whole word models, as described in A. E. Rosenberg, C. H. Lee and S. Gokeen, “Connected Word Talker Recognition Using Whole Word Hidden Markov Models”,
Proceedings ICASSP,
pages 381-384 (1991) have been considered for speaker verification and speech recognition systems. HMM techniques have the limitation of generally requiring a large amount of data to sufficiently estimate the model parameters.
Other approaches include the use of Neural Tree Networks (NTN). The NTN is a hierarchical classifier that combines the properties of decision trees and neural networks, as described in A. Sankar and R. J. Mammone, “Growing and Pruning Neural Tree Networks”,
IEEE Transactions on Computers,
C-42:221-229, Mar. 1993. For speaker recognition, training data for the NTN consists of data for the desired speaker and data from other speakers. The NTN partitions feature space into regions that are assigned probabilities which reflect how likely a speaker is to have generated a feature vector that falls within the speaker's region.
The above described modeling techniques rely on speech being segmented into subwords. Modeling at the subword level expands the versatility of the system. Moreover, it is also conjectured that the variations in speaking styles among different speakers can be better captured by modeling at the subword level. Traditionally, segmentation and labeling of speech data was performed manually by a trained phonetician using listening and visual cues. However, there are several disadvantages to this approach, including the time consuming nature of the task and the highly subjective nature of decision-making required by these manual processes.
One solution to the problem of manual speech segmentation is to use automatic speech segmentation procedures. Conventional automatic speech segmentation processing has used hierarchical and nonhierarchical approaches.
Hierarchical speech segmentation involves a multi-level, fine-to-course segmentation which can be displayed in a tree-like fashion called dendogram. The initial segmentation is a fine level with the limiting case being a vector equal to one segment. Thereafter, a segment is chosen to be merged with either its left or right neighbor using a similarity measure. This process is repeated until the entire utterance is described by a single segment.
Non-hierarchical speech segmentation attempts to locate the optimal segment boundaries by using a knowledge engineering-based rule set or by extremizing a distortion or score metric. The techniques for hierarchical and non-hierarchical speech segmentation have the limitation of needing prior knowledge of the number of speech segments and corresponding segment modules.
A technique not requiring prior knowledge of the number of clusters is defined as “blind” clustering. This method is disclosed in U.S. patent application Ser. No. 08/827,562 entitled “Blind Clustering of Data With Application to Speech Processing Systems”, filed on Apr. 1, 1997, and its corresponding U.S. provisional application no. 60/014,537 entitled “Blind Speech Segmentation”, filed on Apr. 2, 1996, both of which are herein incorporated by reference. In blind clustering, the number of clusters is unknown when the clustering is initiated. In the aforementioned application, an estimate of the range of the minimum number of clusters and maximum number of clusters of a data sample is determined. A clustering data sample includes objects having a common homogeneity property. An optimality criterion is defined for the estimated number of clusters. The optimality criterion determines how optimal the fit is for the estimated number of clusters to the given clustering data samples. The optimal number of clusters in the data sample is determined from the optimality criterion. The speech sample is segmented based on the optimal boundary locations between segments and the optimal number of segments.
The blind segmentation method can be used in text-dependent speaker verification systems. The blind segmentation method is used to segment an unknown password phrase into subword units. During enrollment in the speaker verification system, the repetition of the speaker's password is used by the blind segmentation module to estimate the number of subwords in the password and locate optimal subword boundaries. For each subword segment of the speaker, a subword segmentator model, such as a neural tree network or a Gaussian mixture model can be used to model the data of each subword.
Further,

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Model adaptation of neural tree networks and other fused... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Model adaptation of neural tree networks and other fused..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Model adaptation of neural tree networks and other fused... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3116300

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.