Unsupervised identification of nonlinear data cluster in...

Image analysis – Pattern recognition – Classification

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S156000, C382S197000, C382S224000, C382S253000, C706S015000, C706S020000, C706S025000, C704S259000

Reexamination Certificate

active

06226408

ABSTRACT:

BACKGROUND
1. Field of Invention
The present invention relates generally to data processing systems, and more particularly to data processing systems for data mining, pattern recognition, and data classification using unsupervised identification of complex data patterns.
2. Background of the Invention
Statistical classification (or “learning”) methods attempt to segregate bodies of data into classes or categories based on objective parameters present in the data itself. Generally, classification methods may be either supervised or unsupervised. In supervised classification, training data containing examples of known categories are presented to a learning mechanism, which learns one more sets of relationships that define each of the known classes. New data may then be applied to the learning mechanism, which then classifies the new data using the learned relationships. Examples of supervised classification systems include linear regression, and certain types of artificial neural networks, such as backpropagation and learning vector quantization networks.
Because supervised learning relies on the identification of known classes in the training data, supervised learning is not useful in exploratory data analysis, such as database mining, where the classes of the data are to be discovered. Unsupervised learning is used in these instances to discover these classes and the parameters or relationships that characterize them.
Unsupervised classification attempts to learn the classification based on similarities between the data items themselves, and without external specification or reinforcement of the classes. Unsupervised learning includes methods such as cluster analysis or vector quantization. Cluster analysis attempts to divide the data into “clusters” or groups that ideally should have members that are very similar to each other, and very dissimilar to members of other clusters. Similarity is then measured using some distance metric which measures the distance between data items, and clusters together data items that are closer to each other. Well-known clustering techniques include MacQueen's K-means algorithm, and Kohonen's Self-Organizing Map algorithm.
One of the as of yet unattained goals of unsupervised learning has been a general learning method that could be applied in multiple stages to discover increasingly complex structures in the input. A specific case of this problem is that of identifying, or separating, highly nonlinear subspaces of data, perhaps surrounded by noise.
The unsupervised identification of nonlinear subspaces of data is an important problem in pattern recognition and data mining. Many data analysis problems reduce to this identification problem; examples include feature extraction, hierarchical cluster analysis and transformation-invariant identification of patterns. Although work in this field dates back to the 1930s, good solutions exist only for cases where the data occur in approximately linear subspaces, or in cases where the data points group into well-separated, disjoint classes. Instead, if the data points lie along very nonlinear subspaces and is clouded by noise, conventional unsupervised methods fail. In particular, such nonlinear subspaces are more likely in multidimensional real-world data.
Accordingly, it is desirable to provide a system and method of unsupervised learning that is particularly suited to identifying non-linear clusters of data in highly dimensional data, and thus suited for data exploration, such as database mining, or feature extraction, and similar applications.
SUMMARY OF THE INVENTION
The present invention overcomes the limitations of conventional unsupervised and supervised learning methods and systems by providing for unsupervised learning of non-linear subspaces in complex, high dimensional data. The present invention uses a hierarchical combination of clustering/vector quantization and data encoding based on proximity and connectedness of the data distribution. Vector quantization partitions the input data distribution into regions; data encoding expresses relationships between connected regions, and enables these relationships to influence further vector quantization until the number of connected regions in the data reaches a desired number.
The present invention accepts data that is organized as an unordered collection of vectors that is unlabeled with respect to some variable of interest, and outputs a clustering of such data into relatively disjoint clusters, even when there is considerable noise present in the data. The present invention constructs a hierarchical layering of clusters, with each layer successively building increasingly complex approximations to the data space. Each layer in the hierarchy takes its input from the previous layer (with the first receiving the data), learns a graph-based representation of its input, and re-encodes the input data using distance and proximity features of the graph into an encoded form, which is input into the next layer for processing. The repeated processing of the data results in the forming of increasingly larger (more data items) clusters which are not necessarily linear in the data space.
In a preferred implementation, the present invention uses three separate processes. First, a robust vector quantization process is used to distribute cluster centers throughout the input space. The vector quantization process provides a globally optimal distribution by positioning each cluster center as a function of the proximity of all of the data items being input, not merely those that are closest according to some distance metric.
Second, a topological graph construction process links up the cluster centers into a graph that approximates the topology of the input space. The graph models both the distribution of the cluster centers, and the density of the input vectors between the cluster centers.
Third, an encoding process encodes the input vectors using the topological graph, using path lengths along the graph and the rank-ordered distances from cluster centers to each input vector to generate an output code for each cluster. This encoded vector set is used as the input set on a subsequent pass through the data.
The present invention is particularly suited to large data mining operations in multidimensional real-world data. These include pattern recognition, data compression, and invariant feature extraction for representing natural data in symbolic form.


REFERENCES:
patent: 5590218 (1996-12-01), Ornstein
patent: 5649070 (1997-07-01), Connell et al.
patent: 5671293 (1997-09-01), Niki
patent: 5675710 (1997-10-01), Lewis
patent: 5796631 (1998-08-01), Iancu et al.
patent: 5802507 (1998-09-01), Gentric et al.
patent: 5809490 (1998-09-01), Guiver et al.
patent: 6047277 (2000-04-01), Parry et al.
patent: 6049793 (2000-04-01), Tomita
patent: 6100901 (2000-08-01), Mohda et al.
patent: 6104835 (2000-08-01), Han
M.A. Upal and E. Neufeld, “Comparison of Unsupervised Classifiers,” Proceedings of the First International Conference on Information, Statistics and Induction in Science, World Scientific, Melbourne, Australia, 1996, pp. 342-353.
Upal, Afzal, “Monte Carlo Comparison of Non-Hierarchical Unsupervised Classifiers”, MSc Thesis, Department of Computer Science, University of Saskatchewan, 1995.
Martinez, T. and Schulten, K., “Topology Representing Networks”,Neural Networks, vol. 7, No. 3, pp. 507-522, 1994.
Martinez, T.M., Berkovich, S.G., and Schulten, K.J., “Neural-Gas Network for Vector Quantization and its Application to Time-Series Prediction,”IEEE Transactions on Neural Networks, vol. 4, No. 5, pp. 558-569, Jul., 1993.
Sirosh, J., “The Batch Neural Gas Clustering Algorithm,”HNC Software, Inc., Feb. 7, 1997.
“DataBase Mining® Marksman User Guide,”HNC Software, Inc., pp. 6-14, Jun., 1998.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Unsupervised identification of nonlinear data cluster in... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Unsupervised identification of nonlinear data cluster in..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Unsupervised identification of nonlinear data cluster in... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2531477

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.