Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1997-12-02
2001-04-03
Ho, Ruay Lian (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000
Reexamination Certificate
active
06212526
ABSTRACT:
FIELD OF THE INVENTION
The present invention concerns building classification models (also called classifiers) from a large database having records stored on a mass storage device (database system).
BACKGROUND ART
Computer database software stores data records in tables. A database is a set of tables of data along with information about relations between these tables. Tables represent relations over the data and consist of one or more fields (sometimes called columns). A set of records make up a table in a database. The contents of a database are extracted and/or manipulated using a query language that is supported by the database software. Current query languages support the extraction of information that is well-specified (e.g. via a SQL query or a logical specification of a subset of the data contained in the database). To retrieve data from the database, one must specify an exact description of the desired target data set.
One important use of database technology is to help individuals and organizations make decisions based on the data contained in the database. Decision support information varies from the well-specified (e.g. give me a report of sales by store over an exactly specified period of time) to the not-so-well specified (e.g. find me records in the data that are “similar” to records in table A but “dissimilar” from records in table B). In this case, the target set is not specified via an exact query, but is specified implicitly by labeling (or tagging) records in the data (e.g. these records are to be tagged ‘A’ and these others are to be tagged ‘B’). This is an example of a classification task. Classification tasks are one of the activities used in the area of decision support, data analysis, and data visualization. Given an existing database containing records, one is interested in predicting the values of some target fields (also called variables) based on the values of other fields (variables) in the database.
As an example, a marketing company may want to decide whom to target for an ad campaign based on historical data about a set of customers and how they responded to previous ad campaigns. In this case, there is one field being predicted: the field in the database that indicates whether a customer responded to a previous campaign—call it “response field”. The fields used to predict the response field (and hence to classify records) are other fields in the database. For example: age of customer, whether customer owns a vehicle, presence of children in household, and so forth. Other examples where classification over a database is useful include fraud detection, credit approval, diagnosis of system problems, diagnosis of manufacturing problems, recognition of signatures in signals or object recognition in image analysis.
MOTIVATION FOR INVENTION
Using human experts such as statisticians or domain experts (such as data analysts, engineers, or marketing experts) to build classifiers based on existing data is expensive and may not be accurate especially for problems involving large data sets that have a multitude of fields. Even trained scientists can fail in the search for reliable classifiers when the problem is one where the data records have many fields (i.e. the dimensionality of the data is high) and there are a large number of records. An illustrative example of this problem in the case of classifying objects in an astronomical survey database (deciding if a sky object is a star or galaxy) is given in [Fayyad et al 1996]. In that example, a classifier was automatically constructed based on a data table that had 40 fields. Notably, trained scientists were not able to solve this problem effectively. Techniques for building classification models (classifiers) from data automatically have appeared in the pattern recognition, statistics, and computer science literature.
[Fayyad et al 1996] U. M. Fayyad, S. G. Djorgovski, and N. Weir “Automating the Analysis and Cataloging of Sky Surveys”, a chapter in
Advances in Knowledge Discovery and Data Mining,
U. Fayyad et al (Eds.), pp.471-493, MIT Press (1996).
Historically, the methods presented in the literature and implemented in statistical analysis packages do not scale to large data sets. Such methods assume that data can be loaded in the main memory of the computer system and manipulated conveniently in RAM to build a classification model. If the data in question happens to be larger than what can be held in core memory (RAM), the classification software's performance rapidly deteriorates and in many cases the process terminates if one runs out of virtual memory. One goal of the present invention is an automated means of classifying data that can be used on large databases. Notably, the performance of the classification method defined in this invention does not deteriorate if data cannot fit in memory since the invention exploits awareness that data is too large to fit in memory and is hence resident on disk and should be carefully accessed through the database management system (DBMS).
With the growth and proliferation of databases (and now data warehouses) into almost every aspect of activity in organizations: business, science, personal, and government, the need has grown for automated tools to help analyze, visualize, and summarize the contents of large databases. These techniques are sometimes referred to as data mining techniques. Data mining techniques allow for the possibility of computer-driven exploration of the data. This opens up the possibility for a new way of interacting with databases: specifying queries at a much more abstract level than SQL permits. It also facilitates data exploration for problems that, due to high-dimensionality, would otherwise be very difficult for humans to explore, regardless of difficulty of use of, or efficiency issues associated with, SQL.
While DBMS developments have focused on issues of efficiently executing a query once it has been formulated, little attention has been given to the effort needed to formulate queries in the first place. Decision support type queries are typically difficult to formulate. The basic problem is: how can we provide access to data when the user does not know how to describe the goal in terms of a specific query? Examples of this situation are fairly common in decision support situations. One example is a credit card or telecommunications company that would like to query its database of usage data for records representing fraudulent cases. Another example would be a scientist dealing with a large body of data who would like to request a catalog of events of interest appearing in the data. Such patterns, while recognizable by human analysts on a case by case basis, are typically very difficult to describe in a SQL query or even as a computer program in a stored procedure. Often the domain expert does not even know what variables influence the classification problem of interest, but data about past occurrences may be plentiful. A more natural means of interacting with the database is to state the query by example. In this case, the analyst would label a training set of examples of cases of one class versus another and let the data mining system build a model for distinguishing one class from another. The system can then apply the extracted classifier to search the full database for events of interest or to classify future cases as they arrive. This approach is typically much more feasible than detailed modeling of the causal mechanisms underlying the phenomena because examples are usually easily available, and humans find it natural to interact at the level of labeling cases. Also, it is often only possible to obtain a label (class) for a case in retrospect (e.g. fraud).
The difficulty in building classification models manually derives from the fact that humans find it particularly difficult to visualize and understand a large data set. Data can grow along two dimensions: the number of fields (also called dimensions or attributes) and the number of cases. Human analysis and visualization abilities do not
Chaudhuri Surajit
Fayyad Usama
Ho Ruay Lian
Microsoft Corporation
Watts, Hoffmann, Fisher & Heinke Co. L.P.A.
LandOfFree
Method for apparatus for efficient mining of classification... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for apparatus for efficient mining of classification..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for apparatus for efficient mining of classification... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2539118