Scalable system for K-means clustering of large databases

Data processing: database and file management or data structures – Database design – Data structure types

Patent

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

707 3, 707 4, 707 5, 707 2, G06F 1700

Patent

active

060120584

ABSTRACT:
In one exemplary embodiment the invention provides a data mining system for use in evaluating data in a database. Before the data evaulation begins a choice is made of a cluster number K for use in categorizing the data in the database into K different clusters and initial guesses at the means, or centriods, of each cluster are provided. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory. Data contained in the data portion is used to update the original guesses at the centroids of each of the K clusters. Some of the data belonging to a cluster is summarized or compressed and stored as a summarization of the data. More data is accessed from the database and assigned to a cluster. An updated mean for the clusters is determined from the summarized data and the newly acquired data. A stopping criteria is evaluated to determine if further data should be accessed from the database. If further data is needed to characterize the clusters, more data is gathered from the database and used in combination with already compressed data until the stopping criteria has been met.

REFERENCES:
patent: 5819298 (1998-10-01), Wong et al.
patent: 5832182 (1998-11-01), Zhang et al.
patent: 5857179 (1999-01-01), Vaithyanathan et al.
patent: 5890169 (1999-03-01), Wong et al.
M. R. Anderberg, "Cluster Analysis for Applications" pp. 162-163 Academic Press, New York. 1973.
J. Banfield and A. Raftery, "Model-based Gaussian and Non-Gaussian Clustering", Biometrics, vol. 49: 803-821, pp. 15-34, 1993.
R. Brachman, T. Khabaza, W. Kloesgen, G. Piatetsky-Shapiro, and E. Simoudis, "Mining Business Databases" Communications of ACM 39(11). 1996.
P.S. Bradley, O.L. Managasarian, and W.N. Street. 1997. "Clustering via Concave Minimization", in Advances in Neural Information Processing Systems, 9, M.C. Mozer, M.I. Jordan, and T. Petsche (Eds.) pp. 368-374, MIT Press, 1997.
P. Cheeseman and J. Stutz, "Bayesian Classification (AutoClass): Theory and Results", in Advances in Knowledge Discovery and Data Mining, Fayyad, U.,G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.) pp. 153-180, MIT Press, 1996.
U. Fayyad, S.G. Djorgovski and N. Weir, "Automating the Analysis and Cataloging of Sky Surveys", Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Shapiro, P. Smyth and R. Uthurusamy (Eds.), pp. 471-493, Menlo Park, California: AAAI Press/The MIT Press, 1996.
D. fisher, "Knowledge Acquisition via Incremental Conceptual Clustering". Machine Learning, 2:139-172, 1987.
E. Forgy, "Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classifications", biometrics 21:768. 1965.
C. Glymour, D. Madigan, D. Pregibon, and P. Smyth. 1997. "Statistical Themes and Lessons for Data Mining", Data Mining and Knowledge Discovery, vol. 1, No. 1.
Jones, "A Note on Sampling From a Tape File", Communications of the ACM, vol.. 5, 1962.
M. Meila and D. Heckerman, 1998. "An Experimental Comparison of Several Clustering Methods", Microsoft Research Technical Report MSR-TR-98-06, Redmond, WA.
R. NG and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining", Proc. of VLDB-94, 1994.
D. Pregibon and J. Elder, "A Statistical Perspective on Knowledge Discovery in Databases", in Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Shapiro, P. Smyth and R. Uthurusamy (Eds.) pp. 83-116. MIT Press, 1996.
S.Z. Selim and M.A. Ismail, "K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality." IEEE Trans. On pattern Analysis and Machine Inteligence, vol. PAMI-6 No. 1, 1984.
T. Zhang, R. Ramakrishnan and M. Livny, "Birch: A New Clustering Algorithm and Its Applications." Data Mining and Knowledge Discovery 1(2). 1997.
C. M. Bishop. "Neural Networks for Pattern Recognition".Bayes Theorem. Clarendon Press.Oxford pp. 17-23 (1995).
C.M. Bishop. "Neural Networks For Pattern Recognition." The Normal Distribution. Clarendon Press.Oxford. pp. 34-38 (1995).
C.M. Bishop. "Neural Networks For Pattern Recognition." Maximum Likihood. Clarendon Press. Oxford pp. 39-42 (1995).
C.M. Bishop. "Neural Networks For Pattern Recognition." Density Estimation in General. Clarendon Press. Oxford pp. 51-55 (1995).
C. M. Bishop. "Neural Networks for Pattern Recognition." Mixture Models/Maximum Likelihood/EM Algorithm. Clarendon Press.Oxford pp. 59-72 (1995).
R. Duda and P. Hart. "Pattern Classification and Scene Analysis." Bayes Decision Theory. John Wiley & Sons pp. 10-13 (1973).
R. Duda and P. Hart. "Pattern Classification and Scene Analysis." The Normal Density. John Wiley & Sons. pp. 22-24.
R. Duda and P. Hart. "Pattern Classification and Scene Analysis." Maximum Likelihood Estimation: John Wiley & Sons pp. 45-49 (1973.
R. Duda and P. Hart. "Pattern Classification nd Scene Analysis." Sufficient Statistics and The Exponential Family. pp. 62-66 John Wiley & Sons (1973).
R. Duda and P. Hart. "Pattern Classification and Scene Analysis." Density Estimation. John Wiley & Sons Chap. 4, pp. 85-88 (1973).
R. Duda and P. Hart. "Pattern Classification and Scene Analysis." Unsupervised Learning and Clustering. John Wiley & Sons. Chap. 6 pp. 189-200 (1973).
R. Duda and P. Hart. "Pattern Classification and Scene Analysis." Clustering Criteria (K-Mean): John Wiley & Sons Chap. 6 pp. 217-219 (1973).
R. Duda and P. Hart. "Pattern Classificationa nd Scene Analysis," Iterative Optimization. (relates to K-Mean/EM) John Wiley & Sons Chap. 6 pp. 225-228 (1973).
K. Fukunaga. "Statistical Pattern Recognition". Bayes Theorem Academic Press Chap. 1 pp. 12-13 (1990).
K. Fukanaga. "Statistical Pattern Recognition." Normal Distributions. Academic Press. Chap. 2 pp. 16-24 (1990).
K. Fukanaga. "Statistical Pattern Recognition." Clustering Academic Press. Chap. 11 pp. 508-512 (1990).
R. Duda and P. Hart. "Pattern Classification and Scene Analysis." Nearest Mean Reclassification Algorithm (k-Mean): Chap.11 pp. 515-523.Academic Press. (1990).
K. Fukunaga. "Statistical Pattern Recognition". Maximum Likelihood. Academic Press Chap. 11 pp. 527-532 (1990).
A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society, Series B, 39(1): 1-38, 1977.
M. Ester, H. Kriegel, X. Xu, "A Database Interface for Clustering in Large Spatial Databases", Proc. First International Conference on Knowledge Discovery and Data Mining KDD-95 AAAI Press, 1995.
U. Fayyad, D. Haussler, and P. Stolorz. "Mining Science Data". Communications of the ACM 39(11), 1996.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Scalable system for K-means clustering of large databases does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Scalable system for K-means clustering of large databases, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Scalable system for K-means clustering of large databases will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-1080462

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.