Method and apparatus for scalable probabilistic clustering...

Data processing: artificial intelligence – Knowledge processing system – Knowledge representation and reasoning technique

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C703S002000

Reexamination Certificate

active

06564197

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to the field of data analysis. In particular, the invention relates to the use of probabilistic clustering to produce a decision tree.
2. Description of the Related Art
Clustering
Identifying clusters helps in the identification of patterns in a set of data. As the size of sets of data such as databases, data warehouses, and data marts has grown, this type of knowledge discovery, or data mining, has become increasingly more important. Data mining allows patterns and predictions about these large sets of data be made.
Additionally, for many decision making processes, it is also important that the results be interpretable, or understandable. Complex formulas or graphical relationships may not be suited to allowing a human to gain insight as to the trends and patterns in a set of data.
For example, consider a financial institution that wants to evaluate trends in its loan practices. The actual set of data with loan information and lending decisions may have millions of data points. By identifying clusters, it is possible to identify groups of records that exhibit patterns, or strong internal consistencies, to one another. For example, one cluster of people who were approved for loans might be those with high incomes and low debts. While this is not a tremendous surprise, clustering can also identify non-obvious patterns in the data. The results may also have predictive value about future loans. For example, one cluster might reveal a high number of approved loans in one region, but not another similar region. This information may be useful in making future lending decisions.
When the clusters are interpretable, they can be used to drive decision making processes. However, if the resulting clusters are described in terms of complex mathematical formulas, graphs, or cluster centroid vectors, the usefulness of the clusters is diminished. An example of an interpretable cluster is residents of zip code 94304 (Palo Alto, Calif.). This cluster is easily understood without additional explanation. The cluster can be used to make decisions, adjust company strategies, etc. An example of a non-interpretable cluster is one defined mathematically, e.g. all data points within a given Euclidean distance from a centroid vector.
Several techniques have been used for clustering. The underlying concept behind clustering is that each of the data points in a set of data can be viewed as a vector in a high dimensional space. The vector for a data point is comprised of attributes, sometimes called features. For example, consider a set of data with two attributes for each data element: time and temperature. Thus, a data point X can be written as a 2-vector: X=(x
1
, x
2
), where x
1
is time and X
2
is temperature. As the number of attributes increases, the vector increases in length. For n attributes, a data point X can be represented by an n-vector:
X
=(
x
1
,x
2
, . . . ,x
n
).
In database terminology, the set of data could be a table, or a combination of tables. The data points are records, also called entries or rows. The attributes are fields, or columns.
k-means
One common technique for identifying clusters is the k-means technique. (See Krishnaiah, P. R. and Kanal, L. N.,
Classification Pattern Recongition, and Reduction in Dimensionality
, Amsterdam: North Holland, 1982.) The k-means technique is iterative. The process starts with the placement of k centroids in the domain space. Then, the centroids are adjusted in an iterative process until their position stabilizes. The result is that the clusters are defined in terms of the placement of the centroids.
FIG. 1
shows a set of clusters defined by centroids through the k-means technique. The data points are indicated with “.” in the two dimensional data domain space. The centroids are indicated with “x”. The resulting clusters are formed by those data points within a certain distance of the centroids as indicated by the ellipsoids.
In order to position the centroids and define the clusters, the k-means technique relies on the existence of a similarity, or distance, function for the domain. For example, in a set of data with a domain comprising time and temperature data points, Euclidean distance can be used. In other cases, the Hamming distance is used. However, if the data set comprises discrete attributes, e.g. eye color, race, etc., no clear similarity function is available. This lack of domain independence for the k-means technique limits its application to data domains for which there are well defined similarity functions.
The clusters that result from the k-means technique are difficult to interpret. Because the cluster is defined by a centroid and a distance from the centroid, it is not easy to interpret the results. Returning to the example of loan approval data for a bank, the resulting report would be a list of centroids and distances from the centroids for bank loan data points. The contents of the clusters would not be apparent. This type of information is not easily used to drive decision making processes, except perhaps after further computer analysis.
The k-means technique is also fairly computationally expensive, especially given that additional computational resources will have to be used if any analysis of the clusters is required. In big-O notation, the k-means algorithm is O(knd), where k is the number of centroids, n is the number of data points, and d is the number of iterations.
Hierarchical Agglommerative Clustering
Another prior art technique is hierarchical agglommerative clustering (HAC). (See Rasmussen, E. Clustering Algorithms. In
Information Retrieval: Data Structures and Algorithms
, 1992.) The basic idea behind HAC is that the clusters can be built in a tree-like fashion starting from clusters of one data point and then combining the clusters until a single cluster with all of the data points is constructed.
FIG. 2
illustrates the clusters generated by HAC. The process is as follows, each data point is placed into a cluster by itself, shown by circles surrounding the single data point in FIG.
2
. Then, at the next step, a similarity, or distance function is used to find the closest pair of smaller clusters, which are then merged into a larger cluster. The resulting clusters are junctions in the dendogram shown in FIG.
2
. The process of combining clusters is continued as the tree is built from the bottom to the top as indicated by the arrow in
FIG. 2
showing the flow of time.
As in the k-means technique, a similarity, or distance, function is needed. Therefore, HAC cannot be used on data domains with discrete attributes without a suitable distance function. Also, as in the k-means technique, the resulting clusters are not interpretable, other than by their centroids. For example, turning to the clusters developed in
FIG. 2
, if the user decided that they wanted to consider four clusters, they would select the stage of the process where four clusters existed. Those clusters though are not susceptible to meaningful interpretation except perhaps through further computer analysis. Also, HAC is computationally expensive, O(n
2
), where n is the number of data points.
Returning to the example of loan approval data for a financial institution, knowing that there are two clusters, one with these five million data points and the other with seven million does not convey much, or perhaps any, meaningful information to a human. That is because the clusters produced by HAC are defined in terms of centroids like in k-means.
AutoClass
Another prior art technique is AutoClass, developed by NASA. (See Cheeseman, P. and Stutz, J. Bayesian Classification (AutoClass): Theory and results. In
Advances in Knowledge Discovery and Data Mining
. AAAI Press 1996.) Unlike k-means and HAC, AutoClass can work on domains with discrete attributes and is domain independent because no domain specific similarity functions are required. The concept behind AutoClass is to identify k distributions, e.g. the n-dimensional Gaussian distribution, and fi

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for scalable probabilistic clustering... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for scalable probabilistic clustering..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for scalable probabilistic clustering... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3089924

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.