Data processing: structural design – modeling – simulation – and em – Modeling by mathematical expression
Reexamination Certificate
1999-04-23
2003-04-29
Broda, Samuel (Department: 2123)
Data processing: structural design, modeling, simulation, and em
Modeling by mathematical expression
C707S793000, C707S793000, C706S021000
Reexamination Certificate
active
06556958
ABSTRACT:
FIELD OF THE INVENTION
This invention relates generally to data modeling, and more particularly to efficient data modeling utilizing a sparse representation of the initial data set.
BACKGROUND OF THE INVENTION
Data modeling has become an important tool in solving complex and large real-world computerizable problems. For example, a web site such as www.msnbc.com has many stories available on any given day or month. The operators of such a web site may desire to know whether there are any commonalties associated with the viewership of a given set of programs. That is, if a hypothetical user reads one given story, can with any probability it be said that the user is likely to read another given story. Yielding the answer to this type of inquiry allows the operators of the web site to better organize their site, for example, which may in turn yield increased readership.
For problems such as these, data analysts frequently turn to advanced statistical tools. One such tool is the Expectation Maximization algorithm (hereinafter, the EM algorithm), known within the art, many times used in conjunction with a naïve-Bayes (or, Bayesian) model, as also known within the art. The EM algorithm permits a data analyst to assign each data record (e.g., the stories a given user has read) to a cluster of like data records, where each cluster has a probabilistic dependence for each story. Thus, assignment of a user into a cluster is useful to predict what other stories or types of stories a user may have also read, or may read in the future. (Generally, a cluster is a collection of records, where each record has a similar set of attribute values.)
A disadvantage to utilization of the EM algorithm is that generally, as the size of the data set increases, the running time of the algorithm increases even moreso. This is because the EM algorithm typically cycles over the records in a data set many times, in an iterative manner, until a given convergence criterion is met. This is problematic for problems such as the web site example just described, because typically the data set can run into the millions of records, impeding timely analysis thereof. Thus, a data analyst may not utilize the EM algorithm as much as he or she would like to.
For these and other reasons, there is a need for the present invention.
SUMMARY OF THE INVENTION
The invention relates to efficient data modeling utilizing a sparse representation of the data set. In one embodiment, a data set is first input. The data set has a plurality of records. Each record has at least one attribute, where each attribute has a default value. The attributes of one record can in one embodiment largely overlap with the attributes of another record, although the specific values of the attributes for one record may not so overlap. The method stores a sparse representation of each record, such that the value of each attribute of the record is stored only if it varies from the default value (that is, if the value equals the default value, it is not stored). A data model is then generated, utilizing the sparse representation, and the model is output.
Embodiments of the invention, by utilizing the sparse representation summarized in the preceding paragraph, provide for more efficient and less time-intensive data modeling of the data set. For example, in the context of the EM algorithm, the determination of a joint probability distribution over the cluster membership and the record values is generally required to be made for each record. This typically requires determination of a posterior probability for each attribute of each record, which in the prior art can be a time-consuming process. As a further example, embodiments of the invention provide for speeded-up clustering, as can be appreciated by those of ordinary skill within the art.
However, under embodiments of the invention, determination of a posterior probability can involve only initially generating a joint probability based on the default value of an attribute, and determining the posterior probability based on this joint probability—such that this posterior probability is used for any record having the default value for this attribute. In other words, the posterior probability does not have to be generated anew for every record with this attribute, but rather once for the default value of the attribute, and then subsequently only for those records having a value for this attribute varying from the default value. This greatly decreases the amount of time it takes for the EM algorithm to be run.
Thus, embodiments of the invention allow for use of the EM algorithm to cluster a sparse data set, without examining the value of every attribute of every record. In instances where each record contains, on average, only a small set of attributes that are not in their default state, embodiments of the invention can yield great decreases in run time.
The invention includes computer-implemented methods, machine-readable media, computerized systems, and computers of varying scopes. Other aspects, embodiments and advantages of the invention, beyond those described here, will become apparent by reading the detailed description and with reference to the drawings.
REFERENCES:
patent: 5603022 (1997-02-01), Ng et al.
patent: 5704017 (1997-12-01), Heckerman et al.
patent: 5832182 (1998-11-01), Zhang et al.
patent: 5977889 (1999-11-01), Cohen
patent: 6032146 (2000-02-01), Chadha et al.
patent: 6360224 (2002-03-01), Chickering
patent: 0 789 309 (1997-08-01), None
International Search Report, PCT/US00/10769 published Nov. 2, 2000.*
Baker et al, “Distributional Clustering of Words for Text Classification”, Annual ACM Conference on Research and Development in Information Retrieval, pp. 96-103 (Aug. 1998).*
Goil et al, “High Performance Multidimensional Analysis of Large Datasets”, Proceeding of the ACM First International Workshop on Data Warehousing and OLAP, pp. 34-39 (Aug. 1998).*
Apté et al, “Automated Learning of Decision Rules for Text Categorization”, ACM Transactions on Information Systems, vol. 12 No. 3, pp. 233-251 (Jul. 1994).*
Linoff et al, “Compression of Indexes with Full Positional Information in Very Large Text Databases”, Proceedings of the 16th Annual International ACM Conference on Research and Development in Information Retrieval, pp. 88-95 (Jun. 1993).*
Database Inspec 'Online! Institution of Electrical Engineers, Stevenage, GB; Inspec No. AN6294241, XP002145527 abstract & Meila M et al.: “An experimental comparison of several clustering and initialization methods” Proc 14THConf on Uncertainty in AI, Jul. 24, 1998 through Jul. 26, 1998, pp. 386-395, Madison, WI, USA.
U.S. patent application Ser. No. 08/802,759, filed Jul. 30, 1997, unissued, “Belief Networks” pending ap.
Amin & Turocy LLP
Broda Samuel
Microsoft Corporation
LandOfFree
Fast clustering with sparse data does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Fast clustering with sparse data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fast clustering with sparse data will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3090627