Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-05-22
2001-07-17
Amsbury, Wayne (Department: 2672)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000
Reexamination Certificate
active
06263337
ABSTRACT:
FIELD OF THE INVENTION
The present invention concerns database analysis and more particularly concerns apparatus and method for clustering of data into data sets that characterize the data.
BACKGROUND ART
Large data sets are now commonly used in most business organizations. In fact, so much data has been gathered that asking even a simple question about the data has become a challenge. The modern information revolution is creating huge data stores which, instead of offering increased productivity and new opportunities, are threatening to drown the users in a flood of information. Tapping into large databases for even simple browsing can result in an explosion of irrelevant and unimportant facts. Even people who do not ‘own’ large databases face the overload problem when accessing databases on the Internet. A large challenge now facing the database community is how to sift through these databases to find useful information.
Existing database management systems (DBMS) perform the steps of reliably storing data and retrieving the data using a data access language, typically SQL. One major use of database technology is to help individuals and organizations make decisions and generate reports based on the data contained in the database.
An important class of problems in the areas of decision support and reporting are clustering (segmentation) problems where one is interested in finding groupings (clusters) in the data. Clustering has been studied for decades in statistics, pattern recognition, machine learning, and many other fields of science and engineering. However, implementations and applications have historically been limited to small data sets with a small number of dimensions.
Each cluster includes records that are more similar to members of the same cluster than they are similar to rest of the data. For example, in a marketing application, a company may want to decide who to target for an ad campaign based on historical data about a set of customers and how they responded to previous campaigns. Other examples of such problems include: fraud detection, credit approval, diagnosis of system problems, diagnosis of manufacturing problems, recognition of event signatures, etc. Employing analysts (statisticians) to build cluster models is expensive, and often not effective for large problems (large data sets with large numbers of fields). Even trained scientists can fail in the quest for reliable clusters when the problem is high-dimensional (i.e. the data has many fields, say more than 20).
A goal of automated analysis of large databases is to extract useful information such as models or predictors from the data stored in the database. One of the primary operations in data mining is clustering (also known as database segmentation). Clustering is a necessary step in the mining of large databases as it represents a means for finding segments of the data that need to be modeled separately. This is an especially important consideration for large databases where a global model of the entire data typically makes no sense as data represents multiple populations that need to be modeled separately. Random sampling cannot help in deciding what the clusters are. Finally, clustering is an essential step if one needs to perform density estimation over the database (i.e. model the probability distribution governing the data source).
Applications of clustering are numerous and include the following broad areas: data mining, data analysis in general, data visualization, sampling, indexing, prediction, and compression. Specific applications in data mining including marketing, fraud detection (in credit cards, banking, and telecommunications), customer retention and churn minimization (in all sorts of services including airlines, telecommunication services, internet services, and web information services in general), direct marketing on the web and live marketing in Electronic Commerce.
The framework we present in this invention satisfies the following Data Mining criteria:
1. Require one scan (or less) of the database if possible: a single data scan is considered costly, early termination if appropriate is highly desirable.
2. On-line “anytime” behavior: a “best” answer is always available from the system, with status information on progress, expected remaining time, etc.
3. Suspendable, stoppable, resumable; incremental progress saved to resuming a stopped job.
4. Ability to incrementally incorporate additional data with existing models efficiently.
5. Work within confines of a given limited RAM buffer
6. Utilize variety of possible scan modes: sequential, index, and sampling scans if available.
7. Ability to operate with forward-only cursor over a view of the database. This is necessary since the database view may be a result of an expensive join query, over a potentially distributed data warehouse, with much processing required to construct each row (case).
SUMMARY OF THE INVENTION
The present invention concerns a process for scaling a known probabilistic clustering algorithm, the EM algorithm (Expectation Maximization) to large databases. The method retains in memory only data points that need to be present in memory. The majority of the data points are summarized into a condensed representation that represents their sufficient statistics. By analyzing a mixture of sufficient statistics and actual data points, the invention achieves much better clustering results than random sampling methods and with dramatically lower memory requirements. The invention allows the clustering process to terminate before scanning all the data in the database, hence gaining a major advantage over other scalable clustering methods that require at a minimum a full data scan.
The technique embodied in our invention relies on the observation that the EM algorithm applied to a mixture of Gaussians does not need to rescan all the data items as it is originally defined and as implemented in popular literature and statistical libraries and analysis packages. The method can be viewed as an intelligent sampling scheme that employs some theoretically justified criteria for deciding which data can be summarized and represented by a significantly compressed set of sufficient statistics, and which data items must be maintained as individual data records in RAM, and hence occupy a valuable resource. On any given iteration of the algorithm, the sampled data is partitioned into three subsets: A discard set (DS), a compression set (CS), and a retained set (RS). For the first two sets, data is discarded but representative sufficient statistics are kept that summarize the subsets. The retained data set RS is kept in memory.
In accordance with an exemplary embodiment of the invention, data clustering is performed by characterizing an initial set of data clusters with a data distribution and obtaining a portion of data from a database stored on a storage medium. This data is then assigned to each of the plurality of data clusters with a weighting factor. The weighting factor assigned to a given data record is used to update the characterization of the data clusters. Some of the data that was accessed from the database is then summarized or compressed based upon a specified criteria to produce sufficient statistics for the data satisfying the specified criteria. The process of gathering data from the database and characterizing the data clusters based on newly sampled data from the database takes place until a stopping criteria has been satisfied.
The invention differs from a clustering system that uses a traditional K-means clustering choice where each data point is assigned to one and only one cluster. In this so called EM process using an expectation maximization process, each data point is assigned to all of the clusters but it is assigned with a weighted factor that normalizes the contributions. These and other objects, advantages and features of the invention will become understood from a review of an exemplary embodiment of the invention which is described in conjunction with the accompanying drawings.
REFERENCES:
patent: 5
Bradley Paul S.
Fayyad Usama
Reina Cory
Amsbury Wayne
Havan Thu-Thao
Microsoft Corporation
Watts, Hoffmann, Fisher & Heinke Co. L.P.A.
LandOfFree
Scalable system for expectation maximization clustering of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Scalable system for expectation maximization clustering of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Scalable system for expectation maximization clustering of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2449057