Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-10-26
2001-07-31
Black, Thomas (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000
Reexamination Certificate
active
06269376
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to systems and methods of data storage and retrieval, and in particular to a method and system for clustering data in a multiprocessor system.
2. Description of the Related Art
The ability to manage massive amounts of information in large scale databases has become of increasing importance in recent years. Increasingly, data analysts are faced with ever larger data sets, some of which measure in gigabytes or even terabytes. One way to increase the efficiency of the use of such databases is through the use of data mining. Data mining involves the process or processing masses of data to uncover patterns and relationships between data entries in the database. Data mining may be accomplished manually by slicing and dicing the data until a data pattern emerges, or it can be accomplished by data mining programs.
Clustering is a commonly used procedure in data mining algorithms. Practical applications of clustering include unsupervised classification and taxonomy generation, nearest neighbor searching, scientific discovery, vector quantization, text analysis, and navigation.
The k-means algorithm is a popular procedure for clustering data sets. This procedure assumes that the data “objects” to be clustered are available as points (or vectors) in a d-dimensional Euclidean space. The k-means algorithm seeks a minimum variance grouping of data that minimizes the sum of squared Euclidean distances from certain cluster centroids. The popularity of the k-means algorithm can be attributed to its relative ease of interpretation, implementation simplicity, scalability, convergence speed, adaptability to sparse data, and ease of out-of-core (out of the local memory of a single processor) implementation.
While the k-means algorithm is effective, it is no panacea for large databases like those of text documents and customer market data, which often include millions of data points. Applying the k-means algorithm in such cases can result in unacceptably long processing times and can exhaust the memory capacity of the processor implementing the algorithm. The use of non-volatile memory devices such as hard disks for virtual memory solves the memory problem, but at very high throughput cost. What is needed is a clustering algorithm and an apparatus for implementing that algorithm that allows for the rapid processing of large databases. The present invention satisfies that need.
SUMMARY OF THE INVENTION
To address the requirements described above, the present invention discloses a method, apparatus, article of manufacture, and a memory structure for detecting relationships in a database in parallel using a distributed-memory multi-processor system, thus allowing rapid and flexible k-means computation for data mining.
The method comprises the steps of dividing a set of data points into a plurality of data blocks, initializing a set of k global centroid values in each of the data blocks k initial global centroid values, performing a plurality of asynchronous processes on the data blocks, each asynchronous process assigning each data point in each data block to the closest global centroid value within each data block, computing a set of k block accumulation values from the data points assigned to the k global centroid values, and recomputing the k global centroid values from the k block accumulation values. The article of manufacture comprises a data storage device tangibly embodying instructions to perform the method steps described above.
The apparatus comprises a plurality of asynchronous processors, each associated with one of the plurality of data blocks and operating on the data points within the associated data blocks, each processor implementing a plurality of modules, including first module for initializing a set of k global centroid values to k initial global centroid values, a second module for assigning each data point in each data block to the closest global centroid value, a third module for computing a set of k block accumulation values from the data points assigned to the k global centroid values, and a fourth procedure for recomputing the global centroid values from the k block accumulation values from the plurality of data blocks.
The present invention also describes a memory for storing data for clustering a set of data points about k global centroid values. The memory is comprised of a plurality of local memories, each directly accessible by one of a plurality of processors intercoupled by a communication network. Each of the local memories comprises a data structure including a data block comprising a unique subset of the set of data points, block accumulation values and global centroid values.
REFERENCES:
patent: 5448727 (1995-09-01), Annevelink
patent: 5506801 (1996-04-01), Tawel
patent: 5519789 (1996-05-01), Etoh
patent: 6024018 (2000-02-01), Darel et al.
patent: 6047282 (2000-04-01), Wilson et al.
patent: 6049777 (2000-04-01), Sheena et al.
patent: 6049797 (2000-04-01), Guha et al.
patent: 6070159 (2000-05-01), Wilson et al.
patent: 6092072 (2000-07-01), Guha et al.
Ding et al., “Necessary conditions on minimal system configuration for general MISO Mamdani Fuzzy Systems as Universal Approximators”, Systems, Man and Cybernetics, Part B, IEEE Transactions on, vol. 30, Issue 6, pp. 857-864, Dec. 2000.*
Paliwal et al., “Comments on ‘modified K-means algorithm for vector quantizer design’”, Image Processing, IEEE Transactions on, vol. 9, Issue 11, pp. 1964-1967, Nov. 2000.*
Lee et al., “Morphology-based three-dimensional interpolation”, Medical Imaging, IEEE Transactions on, vol. 19, Issue 7, pp. 711-721, Jul. 2000.*
Berger, Marsha et al., “An Algorithm for Point Clustering and Grid Generation,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 21, No. 5, Sep./Oct. 1991, pp. 1278-1286.
Duda, Richard O. et al., “Pattern Classification and Scene Analysis,” John Wiley & Sons, New York, 1973, pp. 210-257.
Ester, Martin et al., “A Database Interface for Clustering in Large Spatial Databases,” Conference on Knowledge Discovery and Data Mining, KDD-95, AAAI Press, 1996, pp. 94-99.
Fisher, Douglas H., “Knowledge Acquisition Via Incremental Conceptual Clustering,” pp. 267-283 (originally published in Machine Learning, No. 2:139-172); 1987.
Fukunaga, K. et al., “A Branch and Bound Algorithm for Computing &kgr;-Nearest Neighbors,” IEEE Transactions on Computers, Jul. 1975, pp. 750-753.
Han, Eui-Hong (Sam) et al., “Clustering Based on Associated Rule Hypergraphs,” Proceedings of the Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD '97), Tucson, Arizona, 1997, pp. 9-13.
Huang, Zhexue, “A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining,” SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, SIGMOD-DMKD 1997, pp. 1-8.
Keim, Daniel A., “Enhancing Visual Clustering of Query-Dependent Database Visualization Techniques Using Screen-Filling Curves,” Proceedings of the Workshop on Database Issues for Data Visualization, Atlanta, Georgia, 1995, pp. 101-110.
Ketterlin, A. et al., “Conceptual Clustering in Structured Databases: a Practical Approach,” Proceedings of the First International Conference on Knowledge Discovery & Data Mining, KDD-95, AAAI Press, 1996, pp. 180-185.
Ng, Raymond T. et al., “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proceedings of the 20thVLDB Conference, Santiago, Chile, 1994, pp. 144-155.
Pollard, David, “Quantization and the Method of &kgr;-Means,” IEEE Transactional on Information Theory, vol. IT-28, No. 2, Mar. 1982, pp. 199-205.
Pollard, David, “A Central Limit Theorem for &kgr;-Means Clustering,” The Annals of Probability, vol. 10, No. 4, 1982, pp. 919-926.
Ripley, Brian, “Pattern Recognition & Neural Networks,” Cambridge University Press, 1996, pp. 311-322.
Ralambondrainy, H., “A Conceptual Version of the &kgr;-means Algorithm,” Pattern Recognition Letters 16, 1995, pp. 1147-1157.
Selim, Shokri Z. et al., “&kgr;-Means-Type Algorithms: A Generalized Convergence Theorem and
Dhillon Inderjit Singh
Modha Dharmendra Shantilal
Black Thomas
Gates & Cooper LLP
International Business Machines - Corporation
Jung David
LandOfFree
Method and system for clustering data in parallel in a... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for clustering data in parallel in a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for clustering data in parallel in a... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2446578