Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-05-05
2003-07-08
McFadden, Susan (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S244000
Reexamination Certificate
active
06591235
ABSTRACT:
BACKGROUND
1. Technical Field
The present invention generally relates to high dimensional data and, in particular, to methods for mining and visualizing high dimensional data through Gaussianization.
2. Background Description
Density Estimation in high dimensions is very challenging due to the so-called “curse of dimensionality”. That is, in high dimensions, data samples are often sparsely distributed. Thus, density estimation requires very large neighborhoods to achieve sufficient counts. However, such large neighborhoods could cause neighborhood-based techniques, such as kernel methods and nearest neighbor methods, to be highly biased.
The exploratory projection pursuit density estimation algorithm (hereinafter also referred to as the “exploratory projection pursuit”) attempts to overcome the curse of dimensionality by constructing high dimensional densities via a sequence of univariate density estimates. At each iteration, one finds the most non-Gaussian projection of the Xcurrent data, and transforms that direction to univariate Gaussian. The exploratory projection pursuit is described by J. H. Friedman, in “Exploratory Projection Pursuit”, J. American Statistical Association, Vol. 82, No. 397, pp. 249-66, 1987.
Recently, independent component analysis has attracted a considerable amount of attention. Independent component analysis attempts to recover the unobserved independent sources from linearly mixed observations. This seemingly difficult problem can be solved by an information maximization approach that utilizes only the independence assumption on the sources. Independent component analysis can be applied for source recovery in digital communication and in the “cocktail party” problem. A review of the current status of independent component analysis is described by Bell et al., in “A Unifying Information-Theoretic Framework for Independent Component Analysis”, International Journal on Mathematics and Computer Modeling, 1999. Independent component analysis has been posed as a parametric probabilistic model, and a maximum likelihood EM algorithm has been derived, by H. Attias, in “Independent Factor Analysis”, Neural Computation, Vol. 11, pp. 803-51, May 1999.
Parametric density models, in particular Gaussian mixture density models, are the most widely applied models in large scale high dimensional density estimation because they offer decent performance with a relatively small number of parameters. In fact, to limit the number of parameters in large tasks such as automatic speech recognition, one assumes only mixtures of Gaussians with diagonal covariances. There are standard EM algorithms to estimate the mixture coefficients and the Gaussian means and covariance. However, in real applications, these parametric assumptions are often violated, and the resulting parametric density estimates can be highly biased. For example, mixtures of diagonal Gaussians
p
⁡
(
χ
)
=
∑
i
=
1
1
⁢
⁢
π
i
⁢
G
⁡
(
χ
,
μ
i
,
Σ
i
)
roughly assume that the data is clustered, and within each cluster the dimensions are independent and Gaussian distributed. However, in practice, the dimensions are often correlated within each cluster. This leads to the need for modeling the covariance of each mixture component. The following “semi-tied” covariance has been proposed:
&Sgr;
i=
A&Lgr;
i
A
T
where A is shared and for each component, &Lgr;
i
is diagonal. This semi-tied co-variance is described by: M. J. F. Gales, in “Semi-tied Covariance Matrices for Hidden Markov Models”, IEEE Transactions Speech and Audio Processing, Vol. 7, pp. 272-81, May 1999; and R. A. Gopinath, in “Constrained Maximum Likelihood Modeling with Gaussian Distributions”, Proc. of DARPA Speech Recognition Workshop, Feb. 8-11, Lansdowne, Va., 1998. Semi-tied covariance has been reported in the immediately preceding two articles to significantly improve the-performance of large vocabulary continuous speech recognition systems.
Accordingly, there is a need for a method that transforms high dimensional data into a standard Gaussian distribution which is computationally efficient.
SUMMARY OF THE INVENTION
The present invention is directed to methods for high dimensional data mining and visualization via Gaussianization. These methods are premised upon a method referred to herein as “Gaussianization”, which transforms high dimensional data into a standard Gaussian distribution. The Gaussianization method is an iterative method that converges. As is shown herein, the Gaussianization method is computationally more efficient than prior art methods that transform high dimensional data into Gaussian variables.
An iterative expectation maximization (EM) method is also provided herein, which increases the auxiliary function Q of the EM method with each iteration. The EM method of the invention may be advantageously employed in the Gaussianization method of the invention to obtain convergence.
According to a first aspect of the invention, a method is provided formining high dimensional data. The method includes the step of linearly transforming the high dimensional data into less dependent coordinates, by applying a linear transform of n rows by n columns to the high dimensional data. Each of the coordinates are marginally Gaussianized, the Gaussianization being characterized by univariate Gaussian means, priors, and variances. The transforming and Gaussianizing steps are iteratively repeated until the coordinates converge to a standard Gaussian distribution. The coordinates of all iterations are arranged hierarchically to facilitate data mining. The arranged coordinates are then mined.
According to a second aspect of the invention, the transforming step further comprises the step of applying an iterative maximum likelihood expectation maximization (EM) method to the high dimensional data.
According to a third aspect of the invention, the method further comprises the step of computing a log likelihood of the high dimensional data, prior to the transforming step.
According to a fourth aspect of the invention, the EM method comprises the step of computing an auxiliary function Q of the EM method based upon the log likelihood of the high dimensional data. The univariate Gaussian priors are updated, to maximize the auxiliary function Q. The univariate Gaussian variances, the linear transform, and the univariate Gaussian means, are respectively updated to maximize the auxiliary function Q. The transform is updated in the preceding step row by row. The second updating step is repeated, until the auxiliary function Q converges to a local maximum. The computing step and the second updating step are repeated, until the log likelihood of the high dimensional data converges to a local maximum.
According to a fifth aspect of the invention, the linear transform is fixed, when the univariate Gaussian variances are updated.
According to a sixth aspect of the invention, the univariate Gaussian variances are fixed, when the linear transform is updated.
According to a seventh aspect of the invention, the linear transform is fixed, when the univariate Gaussian means are updated.
According to an eighth aspect of the invention, the arranging step hierarchically arranges the coordinates of all the iterations in a tree structure.
According to a ninth aspect of the invention, a method is provided for visualizing high dimensional data. The method includes the step of linearly transforming the high dimensional data into less dependent coordinates, by applying a linear transform of n rows by n columns to the high dimensional data. Each of the coordinates are marginally Gaussianized, the Gaussianization being characterized by univariate Gaussian means, priors, and variances. The transforming and Gaussianizing steps are iteratively repeated until the coordinates converge to a standard Gaussian distribution. The coordinates of all iterations are arranged hierarchically into high dimensional data sets to facilitate data visualization. The high dimensional data sets are then visualized.
These and other aspects, features an
Chen Scott Shaobing
Gopinath Ramesh Ambat
F. Chau & Associates LLP
McFadden Susan
LandOfFree
High dimensional data mining and visualization via... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with High dimensional data mining and visualization via..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and High dimensional data mining and visualization via... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3075609