Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1999-09-13
2001-07-10
Alam, Hosain T. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06260038
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to the field of data processing. More particularly, the present invention relates to a method and apparatus for clustering data points in a data set having mixed attributes or features.
2. Description of the Related Art
Conventional data clustering techniques perform well when all of the data points of a data set contain the same type of attributes or features. That is, all data points of the data set have only one type of attribute, such as categorical, binary or real data (numeric, continuous) attributes. Conventional data clustering techniques, such as the k-medians, k-prototype and k-means algorithms, breakdown when a data set has mixed-mode attributes, such as attributes that are a combination of categorical, binary and/or real data attributes.
The k-medians clustering algorithm is designed for clustering data having categorical attributes, but not for data having mixed attributes. The k-prototype algorithm does not handle mixed attributes directly, but uses a tuneable parameter for combining attribute types. Nevertheless, the k-prototype algorithm produces less than optimal results than for data having purely categorical attributes.
H. Ralambondarainy discloses a data clustering technique for converting data having categorical attributes to 1-of-p representations that are then combined with data of the same data set having real attributes. The combined 1-of-p representations and real attributes are used directly in a clustering algorithm, such as the k-means algorithm.
The other conventional techniques for clustering categorical attributes are either hierarchical algorithms or conceptual clustering algorithms. The hierarchical algorithms are 0(n
2
), where n is the number of data points, and, consequently, are too computationally intensive for large data sets. The conceptual clustering algorithms are not particularly useful for numeric attributes, particularly when the data is noisy.
What is needed is an efficient way for clustering data points having mixed attributes whether the attributes are categorical, binary and/or real data attributes.
SUMMARY OF THE INVENTION
The present invention provides an efficient way for clustering data points having mixed attributes whether the attributes are categorical, binary and/or real data attributes. The advantages of the present invention are provided by a method for clustering data points in a data set that is arranged as a matrix having n objects and m attributes. Each categorical attribute of the data set is converted to a 1-of-p representation of the categorical attribute. A converted data set A is formed based on the data set and the 1-of-p representation for each categorical attribute.
The converted data set A is compressed using, for example, a Goal Directed Projection (GDP) compression technique or a Singular Value Decomposition (SVD) compression technique, to obtain q basis vectors, with q being defined to be at least m+1. The transformed data set is projected onto the q basis vectors to form a data matrix, with each vector having q dimensions. Lastly, a clustering technique is performed on the data matrix having vectors having q dimensions.
According to the invention, the step of compressing the converted data set preferably includes the steps of partitioning the objects of the converted data set, or a sample of the objects of the converted data set, into k cohesive groups using a k-means clustering technique or an expectation maximization (EM) technique, with k being less than m, and computing a distance from each object to a centroid of each group. Accordingly, when the k-means technique is used, the distance measure can be defined to be, for example, a Euclidean metric, a city-block metric or a cosine similarity metric. Further, the distance from each data point to the centroid of each group can use a different distance measure for each prototype vector.
The present invention also provides that the step of compressing the converted data set can include subtracting a mean of the converted data set from each object of the converted data set, partitioning the objects into k cohesive groups, with k being less than m, and computing a distance from each object to a centroid of each group.
When the data set includes at least one attribute having a categorical portion and a corresponding real portion, the step of converting each categorical attribute includes separating the categorical portion of each respective attribute from the corresponding real portion of the attribute, and converting the categorical portion of each attribute to a 1-of-p representation. For this situation, the step of compressing the converted data set A compresses the converted categorical portion of each attribute. Then, each vector having q dimensions is combined with the associated real portion of each attribute before the transformed data set is projected. Consequently, the step of performing the clustering technique is performed on a data matrix resulting from the combination of each vector having q dimensions with the corresponding real portion of each attribute.
In the situation when the data set includes at least one attribute having a categorical portion and a corresponding real portion, the step of converting each categorical attribute includes separating the categorical portion of each respective attribute from the corresponding real portion of the attribute, converting the categorical portion of each respective attribute to a 1-of-p representation, and combining the 1-of-p representation of the categorical portion and the corresponding real portion of each respective attribute. In this situation, the combined categorical Iof-p representation of the categorical portion and the corresponding real portion of each respective attribute are compressed.
REFERENCES:
patent: 5271097 (1993-12-01), Barker et al.
patent: 5448727 (1995-09-01), Annevelink
patent: 5471567 (1995-11-01), Soderberg et al.
patent: 5983220 (1999-11-01), Schmitt
patent: 6012058 (2000-01-01), Fayyad et al.
patent: 6029195 (2000-02-01), Herz
patent: 6032146 (2000-02-01), Chadha et al.
patent: 6038574 (2000-03-01), Pitkow et al.
patent: 6049797 (2000-04-01), Guha et al.
patent: 6115708 (2000-09-01), Fayyad et al.
Chaudhuri et al., “A novel multiseed nonhierarchical data clustring technique”, IEEE, vol. 27, No. 5, pp. 871-877, Oct. 1997.*
Burd et al., “Investigating component based maintenance and the effect of software evolution: a reegineering approach using data clustering”, IEEE, pp. 199-206, Jan. 1997.*
Liu et al., “Feature selection via discretization”, IEEE, vol. 9, No. 4, pp. 642-645, Jul. 1997.*
J. C. Gower, “A General Coefficient of Similarity and Some of its Properties,” Biometrics 27, 857-874, Dec. 1971.
H. Ralambondrainy, “A conceptual version of the K-means algorithm, ” Pattern Recognition Letters 16, 1995 pp. 1147-1157.
M. Berger et al., “An Algorithm for Point Clustering and Grid Generation,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 21, No. 5, 1991, pp. 1278-1286.
D.H. Fisher, “Knowledge Acquistion Via Incremental Conceptual Clustering,” Jul. 4, 1987, pp. 267-283.
J. MacQueen, “Some Method for Classification and Analysis of Multivariate Observations,” pp. 281-297.
R.S. Michalski, “Chapter 4: A Theory and Methodology of Inductive Learning,” Maching Learning: An Artificial Intelligent Approach, Springer, New York,. pp. 83-134.
Duda and Hart, “Pattern Classification and Scene Analysis,” New York: Wiley, 1973, pp. 210-257.
M. Ester et al, “A Database Interface for Clustering in Large Spatial Databases,” Conference on Knowledge Disclovery and Data, pp. 94-99.
S.Z. Selim et al., “K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality,”, IEEE Transactios on Pattern Analysis and Machine Intelligence, vol. PAM1-6, No. 1, Jan. 1984, pp. 81-87.
Z. Huang, “A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining,” pp/ 1-8.
Eui-Hong Han et al., “Clustering Based on Association Rule Hypergraphs,” pp. 9
Martin David C.
Modha Dharmendra Shantilal
Vaithyanathan Shivakumar
Alam Hosain T.
Banner & Witcoff , Ltd.
Corrielus Jean M
International Businemss Machines Corporation
Tran, Esq. Khanh Q.
LandOfFree
Clustering mixed attribute patterns does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Clustering mixed attribute patterns, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Clustering mixed attribute patterns will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2442876