Fast extraction of one-way and two-way counts from sparse data

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C705S007380, C707S793000

Reexamination Certificate

active

06360224

ABSTRACT:

RELATED APPLICATIONS
This application is related to the cofiled, coassigned, and copending U.S. Application No. 09/298,600 which is entitled “Fast Clustering with Sparse Data,” and is hereby incorporated by reference.
FIELD OF THE INVENTION
This invention relates generally to data modeling, and more particularly to extracting two-way counts utilizing a sparse representation of the initial data set.
BACKGROUND OF THE INVENTION
Data modeling has become an important tool in solving complex and large real-world computerizable problems. For example, a web site such as www.msnbc.com has many stories available on any given day or month. The operators of such a web site may desire to know whether there are any commonalties associated with the viewership of a given set of programs. That is, if a hypothetical user reads one given story, can with any probability it be said that the user is likely to read another given story. Yielding the answer to this type of inquiry allows the operators of the web site to better organize their site, for example, which may in turn yield increased readership.
For problems such as these, data analysts frequently turn to advanced statistical tools. Such tools include building and analyzing statistical models such as naïve-Bayes models, decision trees, and branchings, which are a special class of Bayesian-network structures, all of which are known within the art. To construct these models, generally two-way counts must first be extracted from the source data. Two-way counts for a pair of discrete variables define, for each pair of states of the two variables (each pair of states being a unique pair of one variable having a given value and the other variable having another given value, such that no pair has the same values for the variables as does another pair), the number of records in which that pair of states occur in the data. In other words, the counts summarize the information that the data provides about the relationship between the two variables, assuming that this relationship is not influenced by the values for any of the other variables in the domain.
A disadvantage to extracting two-way counts is that generally, as the size of the data set increases, the running time to perform the extraction increases even moreso. This is problematic for problems such as the web site example just described, because typically the data set can run into the millions of records, impeding timely analysis thereof. Thus, a data analyst may not build models that are based on two-way counts extraction as much as he or she would like to.
For these and other reasons, there is a need for the present invention.
SUMMARY OF THE INVENTION
The invention relates to extraction of two-way counts utilizing a sparse representation of the data set. In one embodiment, a data set is first input. The data set has a plurality of records. Each record has at least one attribute, where each attribute has a default value. The method stores a sparse representation of each record, such that the value of an attribute of the record is stored only if it varies from the default value (that is, if the value equals the default value, it is not stored). A data model is then generated, utilizing the sparse representation. Generation of the data model includes initially extracting two-way counts from the sparse representation. Finally, the model is output.
In one embodiment, extracting the two-way counts from the sparse representation includes explicitly counting two-way counts only for values of the attributes that vary from the default values, and explicitly counting one-way counts also only for values of the attributes that vary from the default values. The remaining one-and two-way counts are then derived. For a data set where most attributes of most records are equal to default values, this embodiment of the invention greatly speeds the run time of extracting two-way counts, and, thus, greatly decreases the run time in which statistical models utilizing two-way counts can be generated.
The invention includes computer-implemented methods, machine-readable media, computerized systems, and computers of varying scopes. Other aspects, embodiments and advantages of the invention, beyond those described here, will become apparent by reading the detailed description and with reference to the drawings.


REFERENCES:
patent: 5704017 (1997-12-01), Heckerman et al.
patent: 5977889 (1999-11-01), Cohen
patent: 6009432 (1999-12-01), Tarin
patent: 6117185 (2000-09-01), Schmidt
patent: 6154736 (2000-11-01), Chickering et al.
patent: 0 789 309 (1997-08-01), None
Andrew Moore, Mary Soon Lee, Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets, Journal of Artificial Intelligence Research 8 (Mar. 1998) 67-91.*
U.S. application No. 08/902,759, “Belief Networks” unissued, pending ap.
Aizawa, Reducing the dimensions of attributes by selection and aggregation, Discovery Science, First Int'l Conference, DS '98, Proceedings, Fukuoka, Japan, Dec. 14-16, 1998, pp. 417-418.
Gessert, Handling missing data by using stored truth values, Sigmod Record, Sep. 1991, US, vol. 20, No. 3, pp. 30-42.
Sung, Data mining in a large database environment, IEEE Int'l Conf on Systems, Man, and Cybernetics, New York, IEEE, Oct. 14, 1996, pp. 988-993, specifically p 991, col 1-p 992, col 2 I 29.
Mueller, Fast sequential and parallel algorithms for associate rule mining, a comparison, Technical Report CS-TR-3515, Aug. 1995, Univ. of Maryland, College Park, pp. 1-76, specifically p 22 1 36-p 24 I 7.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Fast extraction of one-way and two-way counts from sparse data does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Fast extraction of one-way and two-way counts from sparse data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fast extraction of one-way and two-way counts from sparse data will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2822879

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.