Computer implemented scalable, incremental and parallel...

Data processing: measuring – calibrating – or testing – Measurement system – Statistical measurement

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000

Reexamination Certificate

active

06684177

ABSTRACT:

FIELD OF THE INVENTION
The present invention is directed toward the field of computer implemented clustering techniques, and more particularly toward methods and apparatus for divide and conquer clustering.
BACKGROUND
In general, clustering is the problem of grouping data objects into categories such that members of the category are similar in some interesting way. The field of clustering spans numerous application areas, including data mining, data compression, pattern recognition, and machine learning. More recently, with the explosion of the Internet and of information technology, “data stream” processing has also required the application of clustering. A “data stream” is an ordered sequence of data points that can only be read once or a small number of times. Some applications producing data streams are customer clicks (on a web site, for example), telephone records, multimedia data, web page retrievals and so on whose data sets are too large to fit in a computer main memory and must be stored first prior to clustering being applied.
The computational complexity of the clustering problem is very well understood. The existence of an efficient optimum clustering algorithm is unlikely, i.e., clustering is “NP-hard”. Conventional clustering methods thus seek to find approximate solutions.
In general, conventional clustering techniques are not designed to work with massively large and dynamic datasets and thus, do not operate well in the context of say, data mining and data stream processing. Most computer implemented clustering methods are based upon reducing computational complexity and often require multiple passes through the entire dataset. Thus, if the dataset is too large to fit in a computer's main memory, the computer must repeatedly swap the dataset in and out of main memory (i.e., the computer must repeatedly access an external data source, such as a hard disk drive). Furthermore, for data stream applications, since the data exceeds the amount of main memory space available, clustering techniques should not have to track or remember the data that has been scanned. The analysis of the clustering problem in the prior art has largely focused on its computational complexity, and not on reducing the level of requisite input/output (I/O) activity. When implementing the method in a computer, there is a significant difference (often by a factor of 10
6
) in access time between accessing internal main memory and accessing external memory, such as a hard disk drive. As a result, the performance bottleneck of clustering techniques that operate on massively large datasets is often due to the I/O latency and not the processing time (i.e., the CPU time).
The I/O efficiency of clustering techniques under different definitions of clustering has also been studied. Some techniques are based on representing the dataset in a compressed fashion based on how important a point is from a clustering perspective. For example, one conventional technique stores those points most important in main memory, compresses those that are less important, and discards the remaining points. Another common conventional technique to handle large datasets is sampling. For example, one technique illustrates how large a sample is needed to ensure that, with high probability, the sample contains at least a certain fraction of points from each cluster. The sampling-based techniques apply a clustering technique to the sample points only. Other techniques compress the dataset in unique ways. One technique, known popularly as Birch, involves constructing a tree that summarizes the dataset. The dataset is broken into subclusters and then each subcluster is represented by the mean (or average value) of the data in the subcluster. The union of these means is the compressed dataset. However, Birch requires many parameters regarding the data that must be provided by a knowledgable user, and is sensitive to the order of the data. Generally speaking, all these typical approaches do not make guarantees regarding the quality of the clustering.
Clustering has many different definitions of quality, and for each definition, a myriad of techniques exist to solve or approximately solve them. One definition of clustering quality is the so-called “k-median” definition. The k-median definition is as follows: find k centers in a set of n points so as to minimize the sum of distances from data points to their closest cluster centers. A popular variant of k-median finds centers that minimize the sum of the squared distances from each point to its nearest center. “k-center” clustering is defined as minimizing the maximum diameter of any cluster, where the diameter is the distance between the two farthest points within a cluster. Most techniques for implementing k-median and similar clustering have large space requirements and involve random access to the input data.
Accordingly, it is desirable to develop a clustering technique with quality of clustering guarantees that operates on massively large datasets for efficient implementation in a computer.
SUMMARY
What is disclosed is a technique that uses a weighted divide and conquer approach for clustering a set S of n data points to find k final centers. The technique comprises 1) partitioning the set S into P disjoint pieces S
1
, . . . , S
P
; 2) for each piece S
i
, determining a set D
i
of k intermediate centers; 3) assigning each data point in each piece S
i
to the nearest one of the k intermediate centers; 4) weighting each of the k intermediate centers in each set D
i
by w
i
the number of points in the corresponding piece S
i
assigned to that center; and 5) clustering the weighted intermediate centers together to find said k final centers, the clustering performed using a specific quality metric and a clustering method A.
In some embodiments of the invention, P is chosen such that each piece S
i
obeys a constraint |S
i
|<M, where M is the size of a memory or portion thereof. In other embodiments, weighted intermediate centers are found for each piece S
i
in a parallel fashion. In some embodiments, sets of data points can be incrementally added to already clustered data sets by finding weighted intermediate centers for the incremental set, and then clustering those weighted intermediate centers with weighted intermediate centers found from the previously clustered data sets.


REFERENCES:
patent: 5187751 (1993-02-01), Tanaka
patent: 5329596 (1994-07-01), Sakou et al.
patent: 6003029 (1999-12-01), Agrawal et al.
patent: 6003036 (1999-12-01), Martin
patent: 6021383 (2000-02-01), Domany et al.
patent: 6115708 (2000-09-01), Fayyad et al.
patent: 6236978 (2001-05-01), Tuzhilin
patent: 6397166 (2002-05-01), Leung et al.
patent: 2002/0052692 (2002-05-01), Fahy
patent: 2002/0099702 (2002-07-01), Oddo
patent: 1191459 (2002-03-01), None

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Computer implemented scalable, incremental and parallel... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Computer implemented scalable, incremental and parallel..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Computer implemented scalable, incremental and parallel... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3253947

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.