Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-09-05
2003-10-28
Mizrahi, Diane D. (Department: 2175)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000
Reexamination Certificate
active
06640227
ABSTRACT:
CROSS-REFERENCE TO RELATED APPLICATIONS
Not Applicable
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
Not Applicable
REFERENCE TO A MICROFICHE APPENDIX
Not Applicable
BACKGROUND OF THE INVENTION
The present invention relates to methods for automated computerized clustering of data in any field of science and technology dealing with data processing.
Data clustering is a quest for order in chaos of data associated with objects under study and is meant to locate associative and genetic links between the objects. In data mining processes, clustering helps to discover the data distribution and patterns in the underlying data, thus increasing the practical and scientific value of a dataset. Most of the clustering techniques consist in data arrayal in such a way as to position the data points in the order of their compliance with certain criterion function of a search and then to join them in clusters based on scores of compliance with the search criterion function. Computer-implemented clustering methods work in an automated mode based on a pre-set criterion function of a search.
In most cases, a set of data associated with objects under study can be presented in the form of a similarity matrix which makes an original massif of data ready for comparative analysis and clustering based on calculated scores of similarity between objects, usually normalized within a range from 0 to 1 or 0 to 100%. Entries within a similarity matrix are then rearranged based on possibly highest proximity of quantitative characters, thus forming clusters and subclusters.
FIG. 1
shows a similarity matrix where S
(x,y)
is a score of similarity between objects x and y computed by any known technique. Even in cases when all n components of a matrix belong to a closely bound “natural” association of objects, such as, for example, a group of enterprises producing a certain type of goods by a certain type technology, or a group of taxonomically close microorganisms, or fish communities of different sections of a river, etc., a similarity matrix per se presents merely a tray of binary similarity cells (BS-cells) arranged according to man-made rules, while the true informational interrelation between the objects oftentimes remains unknown. In order to uncover the underlying infrastructure of informational relationships between the objects, a similarity matrix needs to be analyzed with the use of special techniques, in particular, clustering. A traditional approach to similarity matrix analysis implies that each of the 0.5(n
2
−n) binary similarity values is checked out and added to a pool of binary similarities of a certain group or subgroup.
However, it is easy to prove that the following is true for the matrix shown in FIG.
1
:
S
(A,B)
≈1
S
(A,C)
/S
(B,C)
≈ . . . ≈S
(A,n)
/S
(B,n)
,
i.e. that the 2n−4 cells, too, contain the important and mostly non-additive information, in a non-linear form, on each of the binary similarity cells. Simply put, two similarity coefficients, S
(A,C)
and S
(B,C)
, hide the information about S
(A,B)
which is not retrievable by the heretofore utilized methods for data mining. No generalized approach to solving the problem of extraction of hidden non-linear information has heretofore been proposed.
All heretofore existing methods for data clustering were based on processing of a similarity matrix in a stand-by state as, for example, statistical methods for BS processing in search for object correlations; or in its quasi-dynamic state when BS-cells are purposefully rearranged based on either simple criteria of obvious proximity, or according to more elaborate rules of data point sorting. In both types of a matrix processing, BS-cells remain intact or subjected to a parallel transfer to a different coordinate system.
The following is an example of data clustering based on analysis of a similarity matrix in its static state. In “Using Mean Similarity Dendrograms to Evaluate Classifications” (Van Sickle, 1997), similarity analysis is based on comparisons of mean similarities between objects within the same class to mean similarities between objects of different classes. The author analyzes similarities of a river's various sites based on fish species presence/absence by using the data on the numbers of species common to the two sites and the numbers of species unique to each site, based on the data from three surveys conducted in 1944, 1983 and 1992 on fish assemblages along a 281-km stretch of the main-stem of Willamette River, Oreg. Earlier, other researchers (Hughes et al., 1987) hypothesized that four contiguous sections of the river's 281-km stretch beginning at the river mouth, that could be distinguished based on channel depth and mapped channel gradients, would effectively correlate with the specifics of the river's fish communities. The river's four sections were identified as follows:
1) Section A, with 7 sites (a freshwater tidal section),
2) Section B, with 4 sites (a flat pool section),
3) Section C, with 4 sites (a section with low map gradient), and
4) Section D, with 4 sites (a shallow, upper section with higher map gradient).
FIG. 2A
presents the similarity matrix of the data on fish assemblages sampled at Willamette river sites as given in the above referenced work by Van Sickle (1997). As is seen, it does not quite corroborate with the hypothesis of Hughes et al. (1987): the scores of similarity between different sites of the river's Section D vary within the range of 47 to 67%; within Section C, from 62 to 82%; within Sections B and A, 50-75% and 27-86%, respectively. The ranges of variation of similarity between the sections do not yield any convincing information either: e.g. the score of similarity between Sections A and D varies within the range of 6.7% to 40.0%, i.e. changing almost six-fold.
Based on the above shown similarity matrix, the author derives a mean similarity matrix based on mean similarities within each river section and between the sections. As is seen from the mean similarity matrix derived by Van Sickle (1997) and shown in
FIG. 3
, the similarities between the river's sections vary from 26 to 47%, while the similarities between individual sites within a section vary from 56 to 70%, thus showing that there is actually no remarkable difference between the lowest similarity score within clusters (56%) and the highest similarity score between different clusters (47%). Thus, the mean similarity-based analysis, as well as other mathematical statistics methods, which are used in data clustering merely out of “scientific correctness”, has little, if any, value in understanding the data underlying the study. This is an example of “premeditated clustering” when mathematical statistics methods are applied to offer a proof of a certain assumption, or to substantiate an a priori designed clustering. From the following description of the present invention, it is seen that the method of this invention allows for finding perfect correlations in the above referenced data.
Another approach to clustering is based on analysis of a similarity matrix according to more sophisticated rules of data matching and sorting. For example, in the method described by Guha et al. in Programmed Medium For Clustering Large Databases (U.S. Pat. No. 6,092,072, Jul. 18, 2000), data points that may qualify for inclusion into a certain cluster are spotted in the chaos of data-in information by use of iterative procedures based on determining a total number of links between each cluster and every other cluster based upon an assigned pair of points to be neighbors if their similarity exceeds a certain threshold. Calculating a distance between representative points, then merging a pair of clusters having a closest pair of representative points, then calculating a mean of said representative points for said merged cluster, selecting a number of scattered data points from said representative points of said merged cluster and establishing a new set of representative points for said merged cluster by shrinking said numb
LandOfFree
Unsupervised automated hierarchical data clustering based on... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Unsupervised automated hierarchical data clustering based on..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Unsupervised automated hierarchical data clustering based on... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3147596