Model selection in machine learning with applications to...

Data processing: artificial intelligence – Knowledge processing system

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06584456

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of Invention
The present invention relates generally to the field of data clustering. More specifically, the present invention is related to model selection for improving document clustering.
2. Discussion of Prior Art
Unsupervised learning is an attempt to determine the intrinsic structure in data and is often viewed as finding clusters in a data set. Clustering is an important tool in the analysis of data with applications in several domains such as psychology, humanities, clinical diagnosis, pattern recognition, information retrieval, etc. Model selection in clustering, that is, how to determine adjustments to a number of model parameters, has proven to be particularly challenging. Therefore, there is clearly a need for a system that performs clustering in different feature spaces.
The following references describe prior art in the field of data clustering. The prior art described below does not however relate to the present invention's method of model selection via a unified objective function whose arguments include the feature space and number of clusters.
U.S. Pat. No. 5,819,258 discloses a method and apparatus for automatically generating hierarchal categories from large document collections. Vaithyanathan et al. provide for a top-down document clustering approach wherein clustering is based on extracted features, derived from one or more tokens. U.S. Pat. No. 5,857,179, also by Vaithyanathan et al. provide for a computer method and apparatus for clustering documents and automatic generation of cluster keywords and further teach a document represented by an M dimensional vector wherein the vectors in turn are clustered.
U.S. Pat. No. 5,787,420 provides for a method of ordering document clusters without requiring knowledge of user interests. Tukey et al. teach a document cluster ordering based on similarity between clusters. U.S. Pat. No. 5,787,422, also by Tukey et al. provides for a method and apparatus for information access employing overlapping clusters and suggests document clustering based on a corpus of documents.
U.S. Pat. No. 5,864,855 provides for a parallel document clustering process wherein a document is converted to a vector and compared with clusters.
In addition, U.S. Pat. Nos. 5,873,056, 5,844,991, 5,442,778, 5,483,650, 5,625,767, and 5,808,615 provide general teachings relating to prior art document clustering methods.
An article by Rissanen et al. entitled, “Unsupervised Classification With Stochastic Complexity”, published in the US/Japan Conference on the Frontiers of Statistical Modeling, 1992, discloses that postulating too many parameters leads to overfitting, thereby distorting the density of the underlying data.
An article by Kontkanen et al. entitled, “Comparing Bayesian Model Class Selection Criteria by Discrete Finite Mixtures”, published in the Proceedings of the ISIS '96 Conference, suggests the difficulty in choosing an “optimal” order associated with clustering applications. An article by Smyth entitled, “Clustering Using Monte Carlo Cross-Validation”, published in Knowledge Discovery in Databases, 1996, talks along the same lines of the reference by Kontkanen et al.
An article by Ghosh-Roy et al. entitled, “On-line Legal Aid: Markov Chain Model for Efficient Retrieval of Legal Documents”, published in Image and Vision Computing, 1998, teaches data clustering and clustered searching.
An article by Chang et al. entitled, “Integrating Query Expansion and Conceptual Relevance, Feedback for Personalized Web Information Retrieval”. Chang et al. suggest key word extraction for cluster digesting and query expansion.
All the prior art discussed above has addressed model selection from the point of view of estimating the optimal number of clusters. This art fails to consider clustering within different feature spaces. Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention. They fail to provide for considering the interplay of both the number of clusters and the feature subset in evaluating clustering models. Without this consideration, the prior art also fails to provide an objective method of comparing two models in different feature spaces.
SUMMARY OF THE INVENTION
The present invention provides for a system for model selection in unsupervised learning with applications to document clustering. The current system provides for a better model structure determination by determining both the optimal number of clusters and the optimal feature set.
The problem of model selection to determine both the optimal clusters and the optimal feature set is analyzed in a Bayesian statistical estimation framework and a solution is described via an objective function. The maximization of the said objective function corresponds to an optimal model structure. A closed-form expression for a document clustering problem and the heuristics that help find the optimum (or at least sub-optimum) objective function in terms of feature sets and the number of clusters are also developed.


REFERENCES:
patent: 5442778 (1995-08-01), Pedersen et al.
patent: 5483650 (1996-01-01), Pedersen et al.
patent: 5625767 (1997-04-01), Bartell et al.
patent: 5787420 (1998-07-01), Tukey et al.
patent: 5787422 (1998-07-01), Tukey et al.
patent: 5808615 (1998-09-01), Hill et al.
patent: 5819258 (1998-10-01), Vaithyanathan
patent: 5844991 (1998-12-01), Hochberg et al.
patent: 5857179 (1999-01-01), Vaithyanathan
patent: 5864855 (1999-01-01), Ruocco et al.
patent: 5873056 (1999-02-01), Liddy et al.
A. K. Jain; Data Clustering: A Review; Sep. 1999; ACM; Computing Surveys, vol. 31, No. 3; 264-323.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Model selection in machine learning with applications to... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Model selection in machine learning with applications to..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Model selection in machine learning with applications to... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3157029

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.