Coded data generation or conversion – Digital code to digital code converters – To or from code based on probability
Reexamination Certificate
2001-01-18
2002-05-14
Young, Brian (Department: 2819)
Coded data generation or conversion
Digital code to digital code converters
To or from code based on probability
C706S045000
Reexamination Certificate
active
06388592
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to a method for building statistical predictive models from massive data sets in a computationally tractable way and, more particularly, to an efficient heuristic approach that uses a simple probability model obtained from the massive training data set to generate simulated pseudo data for most aspects of the overall predictive modeling procedure. This heuristic approach reduces the number of expensive data scans over the massive training data set and, consequently, leads to a significant reduction in the computational cost without greatly affecting the overall accuracy of the statistical modeling procedure.
2. Background Description
Predictive modeling techniques are of central importance in applications such as database marketing, targeted advertising, fraud detection, credit/risk modeling, and information retrieval. In all these applications, the training data sets can be potentially enormous. For example, certain marketing databases that we have encountered contain millions of customer records with thousands of attributes per record. The development of statistical predictive modeling algorithms is of central importance in these applications. However, the computational cost of existing statistical procedures is prohibitively expensive for the massive data sets that are encountered therein.
Therefore, typically in these applications, the computational cost is reduced by using preliminary data reduction prior to the statistical modeling procedure. For example, consider a data set with N records and J features per record, where N, M are large. Sub-sampling can be used to select some N′<<N records, and a preliminary feature selection or transformation can be used to select some J′<<J features. This reduced data set with N′ records and J′ features is then used to perform the required statistical modeling in a computationally tractable way. However, sub-sampling is unsatisfactory for heterogenous and high-dimensional data sets, and the resulting model estimates will have a high degree of variability with respect to the sampling. Similarly, the preliminary feature selection or transformation is unsatisfactory since it may inadvertently exclude or de-emphasize the original features that are crucial for the accuracy of the eventual statistical models.
Other techniques have recently emerged for scaling up statistical modeling algorithms to massive data sets. For example, G. Graefe, U. Fayyad and S. Chaudhuri, “On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases”,
Proc. of the Fourth International Conference on Knowledge Discovery and Data Mining
, pp. 204-208, (1998), and A. W. Moore and M. S. Lee, “Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets”,
Journal of Artificial Intelligence Research
, p. 8 (1998), describe efficient methods for evaluating the sufficient statistics for specific statistical modeling algorithms, which allows the modeling computations to be carried without having to refer back to the original data. These approaches work well if the statistical modeling algorithm is fixed in advance, and if it has a a finite set of sufficient statistics that can be easily obtained from the data. However, one or the other of these assumptions may not be true in many predictive modeling situations.
Other approaches to this problem include P. S. Bradley, U. Payyad and C. Reina, “Scaling Clustering Algorithms to Large Databases”,
Proc. of the Fourth International Conference on Knowledge Discovery and Data Mining
, pp. 9-15, (1998), and W. DuMouchel, C. Volinsky, T.Johnson, C. Cortes and D. Pregibon, “Squashing Flat Files Flatter”,
Proc. of the Fifth International Conference on Knowledge Discovery and Data Mining
, pp. 6-15, (1999) who describe methods for producing compressed data sets consisting of a small number of pseudo data points and associated weights. These compressed data sets are used to evaluate statistical models that are more accurate and less variable than that from an equivalent-sized, randomly sub-sampled subset of the original data. However, these approaches require that existing predictive modeling algorithms and computer codes be capable of handling weights for the data points, which is a limitation.
SUMMARY OF THE INVENTION
According to our invention, we provide a heuristic approach that leads to a computationally efficient way for statistical predictive modeling from massive data sets.
This approach is based on constructing a simple probability model from the data, and using this model to generate simulated pseudo data that can be used for some aspects of the statistical modeling procedure. For example, this could be used for preliminary model selection or model parameter estimation. The preliminary model can then be refined in a final step by reverting to the original training data. This reduces the number of data scans over the massive training data set, and thereby reduces the computational cost of the overall statistical modeling procedure.
In order to illustrate this approach, we consider the specific problem of a Naive Bayes model with feature selection. This is a widely used predictive modeling technique in the machine learning community for multi-variate categorical response variables. A straightforward feature selection algorithm would require O(J
2
) data scans over the training data in order to find the Naive Bayes model with the optimum feature subset. However, our heuristic approach uses just two data scans to obtain models of comparable accuracy.
REFERENCES:
patent: 6182058 (2001-01-01), Kohavi
Kaufman Stephen C.
Whitham Curtis & Christofferson, P.C.
Young Brian
LandOfFree
Using simulated pseudo data to speed up statistical... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Using simulated pseudo data to speed up statistical..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Using simulated pseudo data to speed up statistical... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2896234