Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-07-14
2001-02-27
Black, Thomas G. (Department: 2771)
Data processing: database and file management or data structures
Database design
Data structure types
C382S128000, C707S793000
Reexamination Certificate
active
06195659
ABSTRACT:
BACKGROUND OF THE INVENTION
This invention relates generally to multivariate statistical analysis and, more specifically, to clustering techniques used to analyze statistical data. Clustering is also known by the terms unsupervised learning, and categorization. Prior art clustering techniques include the spanning tree method and the expectation-maximization method.
Clustering is based on the reasonable assumption that things having similar attributes also have similar measured characteristics. For example, biologists may categorize biological specimens based on their measured characteristics, which may be plotted in an n-dimensional grid. Specimens with similar measured characteristics falling into a statistical “cluster” of data points on the grid may be defined as belonging to a specific category of biological entity.
A similar analysis may be used to categorize stocks traded in a stock market. The measured characteristics may include share price, price-to-earnings ratio, price volatility, and so forth. When various stocks are sampled and their characteristics are plotted on an n-dimensional grid, categories of stocks emerge from the resulting clustering of patterns of data points. Such categories may be used to identify candidate stocks for purchase or sale.
Clustering techniques can be used in a variety of other fields, including signal analysis and identification, pattern recognition, geological resource exploration, marketing research, and identification of persons by analysis of fingerprints, voice patterns, retinal patterns, or some other form of biometric analysis. Clustering techniques encounter two key problems that are common to all of these applications: cluster proximity and cluster count. In particular, it is sometimes difficult to separate and distinguish clusters that are close together and may appear to overlap. Moreover, there may be some measured data points that do not fall clearly within cluster regions that have already been identified, and these raise the issue of whether to ignore the new data points, or to include them in a selected existing cluster, thereby, perhaps, extending the boundaries of the cluster, or to define a new cluster. Some clustering techniques require knowledge of the number of clusters, which is often not known and is difficult to determine. Other clustering problems include inhomogeneity of cluster density of size, and clusters of unusual shapes, such as crescents or rings. Another practical difficulty inherent to available clustering techniques is that the processing time needed is proportional to the number of data points squared, cubed, or raised to some other power. For example, the spanning tree algorithm used for clustering has a processing time proportional to N
3
, where N is the number of data points being analyzed. The spanning tree approach has the additional drawback that it cannot easily distinguish between categories that are too close together.
Because clustering has such a diverse range of applications, there is clearly a need for a new approach to clustering that can handle larger numbers of data points and categories, and can handle categories that may otherwise be statistically indistinguishable. The present invention is directed to this end.
SUMMARY OF THE INVENTION
The present invention resides in a clustering system and method that uses morphological techniques to analyze multivariate statistical data and to derive therefrom data defining clusters into which the original data points fall. Briefly, and in general terms, the method of the invention comprises the steps of storing the data in the form of a grid of data cells, each of which is switchable to a state representing a data point, the grid having as many dimensions as the multivariate statistical data; and performing multiple morphological operations on the grid of data cells to identify one or more clusters of data.
More specifically, the step of performing multiple morphological operations includes performing multiple dilation steps to expand contiguous regions of the grid covered by data points until adjacent sub-regions merge into larger regions identified as potential clusters; then performing multiple erosion steps to shrink the potential clusters until isolated smaller ones are eliminated and thin linkages between other potential clusters are removed; and then performing further multiple dilation steps to reconnect any potential clusters that were fragmented by the previous step, to leave clusters of data cells. Data points falling within the boundaries of a cluster are said to belong to that cluster and may share at least some attributes.
Each dilation step includes turning each data cell to an “on” condition if any immediately adjacent neighbor cell is already in the “on” condition. Each erosion step includes turning each data cell to an “off” condition if any immediately adjacent neighbor cell is already in the “off” condition. In the presently preferred embodiment of the invention, the step of performing multiple dilation steps includes performing M such steps; and the step of performing multiple erosion steps includes performing M+1 such steps. Finally, the step of performing further multiple dilation steps includes performing N such steps. In the specifically disclosed embodiment discussed further below, M is three and N is four, so there are three initial dilation steps, followed by four erosion steps and then four further dilation steps.
After the clusters have been defined, the method includes the further steps of identifying the data cells in each cluster; and identifying data points that fall within the boundaries of each cluster.
The invention may also be defined as a method for automatically categorizing samples of multivariate statistical data based on whether the samples fall into clusters, the method comprising the steps of storing data samples in the form of a multidimensional grid of data cells, each of which can be switched to a state representing a data point by its position in the grid; performing multiple morphological dilation steps to expand contiguous regions of the grid covered by data points, until adjacent regions merge into larger regions identified as potential cluster regions; then performing multiple morphological erosion steps to shrink the potential cluster regions until smaller ones are eliminated and thin linkages between other potential cluster regions are removed; and then performing further multiple morphological dilation steps to reconnect any potential cluster regions that were fragmented by the previous step, leaving clusters of data cells. Data points falling within the boundaries of a cluster are said to belong to that cluster and to share at least some common attributes. The method continues with the steps of labeling the data cells in each cluster as belonging to that cluster; and labeling data points that fall within the boundaries a cluster as belonging to that cluster.
As already mentioned above, each data point is represented by a data cell in an “on” condition. Each dilation step includes turning each data cell to the “on” condition if any immediately adjacent neighbor data cell is already in the “on” condition; and each erosion step includes turning each data cell to an “off” condition if any immediately adjacent neighbor data cell is already in the “off” condition.
The invention may also be defined as a system for automatically identifying clusters in samples of multivariate statistical data, comprising a memory for storing multivariate statistical data in the form of a logical multidimensional grid of data cells, each of which is switchable to a state representing a sample data point, wherein the position of the data cell in the multidimensional grid represents multiple dimensions of the data point; and a processor programmed to perform multiple morphological operations on the grid of data cells and to thereby identify one or more clusters of data.
More specifically, the processor programmed to perform multiple morphological operations includes a first morphological processing module, fo
Black Thomas G.
Trinh William
TRW Inc.
Yatsko Michael S.
LandOfFree
Method and apparatus for morphological clustering having... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for morphological clustering having..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for morphological clustering having... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2567711