System and method for detecting clusters of information

Image analysis – Pattern recognition – Classification

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06307965

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates in general to finding clusters of information in high dimensional feature spaces. Clusters are associated with a selection of features of the information. In particular, related information in a database is identified by detecting clusters of information and subsets of cluster specific features, about which the information is clustered.
BACKGROUND OF THE INVENTION
It is often desirable to detect sets of related instances in a database that correspond to related information. Instances of information are represented and stored in a database in the form of a set of data values in a multidimensional space. A dimension in the multidimensional space is a feature that characterizes objects represented by the data values. For example, consider an insurance company's database containing customer information. Each customer is an object corresponding to an instance in the database that is a customer profile, or data value, in the multidimensional feature space of the database. Each data value is an n-tuple corresponding to an instance of the features: age, sex, salary, of the customer as well as the number of cars owned by the customer. The dimensions of the multidimensional feature space are the features that characterize the customer namely the age, sex, salary, of the customer and number of cars owned by the customer.
The problem of clustering is the problem of finding sets of data values in a multidimensional feature space that are close to each other, in the sense of some metric, measure or the like, with respect to a particular subset of dimensions. A particular subset of dimensions is a subset of the features that characterize the objects represented by data values stored in the database, and is thus associated with a subspace in the multidimensional feature space. The cluster problem is a known problem in the database literature, for example D. Fisher, “Knowledge Acquisition via Incremental Conceptual Clustering”,
Machine Learning
2(2), 1987; T. Zhang, R. Ramakrishnan and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases”,
Proceedings of the ACM SIGMOD International Conference on Management of Data
, Montreal, Canada, June 1996; R. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining”,
Proceedings of the
20
th International Conference on Very Large Data Bases
, Santiago, Chile, 1994, pp. 144-155; and M. Zait and H. Messatfa, “A Comparative Study of Clustering Methods”,
FGCS Journal, Special Issue on Data Mining,
1997. The clustering problem has numerous applications that relate to other problems such as segmentation, classification, collaborative filtering/data mining and trend analysis. It is also known that existing algorithm designed to solve the problem of clustering break down in high dimensional feature spaces. The difficulty that arises in high dimensional feature spaces is the inherent sparsity of data values. For example, in the above case when objects represented in the database as customer profiles, there may not be many clusters of customer profiles that are similar (close) with respect to all the features: age, sex, salary, number of cars, etc. Thus, when the number of features is high the data may become sparse.
In high dimensional feature spaces, however, only some of the features may be relevant when finding clusters. Therefore, one approach to handling high dimensional feature spaces is to select closely correlated features, project out or ignore the other features, and find clusters in the corresponding subspace. This approach is problematic, however, as it is difficult to find a single subset of features, i.e. one subspace, in which data values cluster well. In other words, different subsets of data values may cluster better for different subsets of features or subspaces.
The clustering problem has been discussed in the literature of both the statistical and database communities. The emphasis in the statistical community is to find clusters based on precisely defined metrics, while the emphasis in the database community is to produce methods for detecting clusters that work efficiently on large data sets. Two known algorithms for finding clusters in large databases are the BIRCH and CLARANS, see T. Zhang, R. Ramakrishnan and M. Livny, supra, and R. Ng and J. Han, supra.
As explained above, many clustering algorithms do not work efficiently in higher dimensional feature spaces because of the sparsity of the data. In many applications, execution of a clustering algorithm is preceded by feature selection. It is desirable to select particular features so that the data values in the feature space are correlated to each other in the subspace associated with the selected features. Pruning away or projecting out undesirable features reduces the number of uncorrelated so features that add noise to the data.
The problem of using traditional feature selection methods is that picking certain features in advance leads to a loss in information. Furthermore, in many data sets, some data values are related with respect to a given set of features and others are correlated with respect other features. Thus, it is often infeasible to prune away or project out too many features without at the same time incurring a substantial loss in information.
In order to illustrate this point, consider the following example. In FIG.
1
(
a
) and FIG.
1
(
b
), two different feature subspaces are illustrated. On each subspace the projection of a set of data values in 3-dimensional feature space is shown. Two patterns of the data emerge: the first pattern corresponds to the cluster
101
of data values in the X-Y plane, while the second pattern corresponds to the cluster
102
of data values in the X-Z plane.
It is advantageous to identify a way of expressing and representing such patterns. Feature pre-selection does not seem an appropriate option, because each feature is relevant to at least one of the clusters. In other words, the “closeness” of data values projected into different subspaces of the 3-dimensional feature space is not uniform with respect to the different features.
In the context of a database of customer profiles that include information about the age, sex, salary, and number of cars, it may be the case that the number of cars owned by a given customer is related in a variety of different patterns depending on whether attention is restricted to customer age or sex or salary or all of the above. Hence, different clusters are found in different subspaces depending on which set of features is considered.
SUMMARY OF THE INVENTION
The present invention includes a system and method for analyzing information in the form of a plurality of data values that represent a plurality of objects. A plurality of objects are collected. A set of features that characterize each object is identified. The plurality of data values are stored in a database. Each of the plurality of data values correspond to at least one of the plurality of objects based on the set of features. A set of clusters of information is detected in the database by associating ones of the plurality of data values with ones of the set of features.


REFERENCES:
patent: 5034981 (1991-07-01), Leonard et al.
patent: 5526446 (1996-06-01), Adelson et al.
patent: 5872850 (1999-02-01), Klein et al.
patent: 6009392 (1999-12-01), Kanevsky et al.
R. Sibson, “SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method”,The Computer Journal,The British Computer Society, vol. 16, No. 1, Feb. 1973.
Tian Zhang et al., “BIRCH: An Efficient Data Clustering Method for Very Large Databases”, Proceedings of the ACM SIGMOND International Conference on Management of Data, Montreal, Canada, Jun. 1996, pp. 103-114.
Raymond T. Ng et al., “Efficient and Effective Clustering Methods for Spatial Data Mining”, Proceedings of the 20thInternational Conference on Very Large Data Bases, Santiago, Chile, 1994, pp. 144-155.
Nick Roussopoulos et al., “Nearest Neighbor Queries”, Proceedings of the ACM-SIGMOD International Conf

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for detecting clusters of information does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for detecting clusters of information, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for detecting clusters of information will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2603498

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.