Data processing: measuring – calibrating – or testing – Measurement system – Measured signal processing
Reexamination Certificate
2000-10-04
2003-06-24
Hoff, Marc S. (Department: 2857)
Data processing: measuring, calibrating, or testing
Measurement system
Measured signal processing
C706S050000, C382S240000
Reexamination Certificate
active
06584433
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to data clustering and more specifically to a method and system for clustering data by employing a K-Harmonic Means (KHM) center-based iterative algorithm.
BACKGROUND OF THE INVENTION
Data Clustering
Data clustering operates to group or partition a plurality of data points into a predetermined number of clusters or categories based on one or more attributes or features.
The efficiency of a clustering algorithm depends on several factors. First, the computation resources required to implement the clustering algorithm is an important consideration. It is generally desirable to reduce the time needed to generate results (often referred to as the convergence rate) and also reduce the amount of computer resources needed to implement the clustering algorithm. For example, those clustering algorithms (e.g., partition clustering and agglomerative clustering) that are computationally intensive and yet provide only tolerable results have been generally abandoned in favor of those clustering algorithms that are less computationally intensive (e.g., center-based clustering algorithms that are also known as density estimation clustering).
Second, the quality of the generated clusters or categories (often referred to as the convergence quality) is also another important consideration. Ideally, there is one center point for each category or cluster. Unfortunately, the prior art methods often generate clusters or categories with more than one center. These centers are referred to as “trapped centers” (i.e., these centers are trapped by the local data, but actually belong to another cluster or category). Consequently it would be desirable for there to be a mechanism to allow an easier escape of trapped centers.
Third, the sensitivity to center initialization is another important factor. Unfortunately, the prior art clustering methods are very dependent on the initialization information (i.e., the quality of the results varies widely for different initialization information). The initialization information is heavily dependent on the amount and quality of available prior information. As can be appreciated, in many instances, there is minimal or no prior information available. In fact, for many applications the clustering is performed specifically for the sake of obtaining this “prior information.”
As described herein below, poor initialization and for that matter what is even considered “good initialization” often results in trapped centers, thereby leading to poor or minimally tolerable results. Consequently, it would be desirable for there to be a mechanism for reducing the dependence of clustering results on the quality and quantity of prior knowledge.
There are many practical and useful applications that can utilize data clustering to improve results. Consequently, there is much interest in developing clustering algorithms or methods that efficiently and effectively cluster data.
Prior Art Data Clustering Methods
K-Means and Expectation Maximization (EM) are two prior art methods for data clustering. Unfortunately, both of these approaches are very sensitive to the initialization of the centers. The dependency of the K-Means performance on the initialization of the centers has been a major problem. A similar problem exists for EM.
There have been numerous attempts to generate “good” initializations in order to address the sensitivity problem. Unfortunately, as illustrated in
FIGS. 1A and 1B
, “good” initializations often do not generate good clustering results.
In the following example, a K-Means clustering method is used to find 100 clusters of the BIRCH data set (from UC Irvine). The BIRCH data set is composed of 100 normally distributed local clusters, in a 10×10 grid, each having 1000 points. Two experiments are then conducted. The first experiment uses a random initialization, and the second experiment uses an initialization generated by the Furthest Point algorithm, which by itself is considered a clustering algorithm.
In FIGS.
1
A and
FIG. 1B
, the initial locations of the centers denoted with “x”s, and the converged locations of the centers are denoted with dots. Both experiments are provided with 100 initial center positions. At first glance, the second initialization appears to provide a better result than the first one. However, upon closer inspection, there are exactly seven pairs of centers trapped by local densities in the local optima K-Means converged to under both initializations. These trapped center pairs are circled as shown. As is well known, the best convergence (or global optimum) should have exactly one center in each local cluster of the data. Consequently, from the point of view of the number of trapped centers, both of these approaches have similarly poor results though different initializations are utilized.
Referring to
FIGS. 1A and 1B
, it is noted that these approaches to provide a “good” initialization often generate poor results. In
FIG. 1A
, a random initialization of the center points is utilized. In
FIG. 1B
, the center points are initialized by utilizing a Furthest Point Algorithm (FPA). The result is that there are seven pairs of centers that are trapped by local densities for both the random initialization and the FPA initialization. Consequently, at least for this set of data points, there is essentially no improvement in the clustering results even though a “good” initialization method is employed in FIG.
1
B. This example illustrates that what constitutes a “good” initialization may not be understood very well by those attempting to improve clustering results of K-means or EM by generating “good” initializations.
Accordingly, there remains a need for a method and system for data clustering that improves the convergence rate, that improves the convergence quality, that allows for an easier escape of trapped centers, that reduces the dependence of clustering results on center initialization, and that overcomes the disadvantages set forth previously.
SUMMARY OF THE INVENTION
It is an object of the present invention to provide a clustering method and system for reducing the dependency of clustering results to the initialization of centers.
It is yet another object of the present invention to provide a clustering method and system for improving the quality of the clustering results (i.e., the convergence quality) of the clustering.
It is a further object of the present invention to provide a clustering method and system for improving the convergence rate of the clustering.
It is another object of the present invention to provide a clustering method and system for distributing the association of the data points with centers to allow a continuous transition (instead of an abrupt transition) of a center from a first set of data points to a second set of data points.
It is yet another object of the present invention to provide a clustering method and system for reducing the strength of association of the data points in a cluster to allow an easier escape of trapped centers.
A harmonic average data clustering method and system. First, a plurality of data points for clustering is received. Next, a number K of clusters is also received. Then, K center points are initialized. For each center point a new center position is then determined by utilizing a K-Harmonic Means performance function.
REFERENCES:
patent: 4558350 (1985-12-01), Murakami
patent: 5208763 (1993-05-01), Hong et al.
patent: 5327521 (1994-07-01), Savic et al.
patent: 5832182 (1998-11-01), Zhang et al.
patent: 5943661 (1999-08-01), Katz
patent: 6100901 (2000-08-01), Mohda et al.
patent: 6115708 (2000-09-01), Fayyad et al.
patent: 6285780 (2001-09-01), Yamakita et al.
patent: 6336082 (2002-01-01), Nguyen et al.
patent: 6374251 (2002-04-01), Fayyad et al.
patent: 6421467 (2002-07-01), Mitra
Dayal Umeshwar
Hsu Meichun
Zhang Bin
Charioui Mohamed
Hewlett-Packard Development Company LP
Hoff Marc S.
LandOfFree
Harmonic average based clustering method and system does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Harmonic average based clustering method and system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Harmonic average based clustering method and system will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3122979