Concept decomposition using clustering

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000

Reexamination Certificate

active

06560597

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates generally to computer document retrieval systems and, more particularly, to computer document content search and classification.
2. Description of the Related Art
Large sets of computer-readable text documents are now increasingly common. For example, the “World-Wide-Web” presently contains over 300 million pages and is growing rapidly. The Internet-based “Patent Server” document service of the International Business Machines Corporation (“IBM Corp.”) includes more than 20 million patents. The database of the Lexis-Nexis division of Reed Elsevier, Inc. contains more than 1 billion documents. Furthermore, an immense amount of text data exists on private corporate intranets, in archives of media companies, and in scientific and technical publishing houses.
Applying efficient search techniques to such collections of text data is of great practical interest. Search engine websites on the Internet are among the most popular sites on the World Wide Web. The IBM Corp. Patent Server and the Lexis-Nexis system mentioned above both incorporate text search and retrieval mechanisms. Document search and retrieval systems are widely deployed within business organizations, typically accessible through corporate intranets. The search techniques being used for these systems include machine learning and statistical algorithms such as clustering, classification, principal component analysis, and discriminant analysis.
A starting point for applying such search techniques to unstructured text data is to create a vector space model for text data. This involves representing documents as vectors of the document words in a word-by-document matrix whose rows are words and columns are documents. Thus, a large collection of documents in a vector space representation can easily comprise a matrix with 100K×100K entries. One methodology for such representations is first (a) to extract unique content-bearing words from the set of documents and treat these words as features, and then (b) to represent each document as a vector of certain weighted word frequencies in this feature space. Typically, a large number of words exist in even a moderately sized set of documents—a few thousand words are common. Hence, the document vectors are very high-dimensional. Nevertheless, most individual documents contain many fewer words, from 1% to 5% of the words or less, in comparison to the total number of words in the entire document collection. Hence, the document vectors are very sparse. Understanding and exploiting the structure of such vector space models is a major contemporary scientific and technological challenge.
In a text search and retrieval context, it has been proposed to use latent semantic indexing (LSI), which uses truncated singular value decomposition (SVD) or principal component analysis to discover “latent” relationships between correlated words and documents. One may interpret LSI as a matrix approximation scheme. Based on earlier work for image compression, a memory efficient matrix approximation scheme known as semi-discrete decomposition has been developed. Others have used an “implicit” matrix approximation scheme based on context vectors. Still others have proposed computationally efficient matrix approximations based on random projections.
Clustering has been used to discover “latent concepts” in sets of unstructured text documents, and to summarize and label such collections. Various classical clustering algorithms are known, such as the “k-means” algorithm and its variants, hierarchical agglomerative clustering, and graph-theoretic methods. Clustering is a way of processing the vector space of a document collection to discover similar documents, based on the fact that documents related to similar topics tend to cluster in the document vector space. A categorization or classification of the documents can therefore take place through clustering techniques.
The k-means algorithm assumes that the document vectors have been normalized to have unit L
2
norm, that is, the documents can be thought of as points on a high-dimensional unit sphere. Such normalization mitigates the effect of differing lengths of documents. In this way, the document vectors should be differentiated solely on the basis of their word content. It is known to measure “similarity” between such vectors by the inner product between them. This measure is known as cosine similarity. The more closely located any two document vectors are to each other, the more closely related are the documents. The results of the k-means technique applied to a collection of documents can be valuable in locating desired documents and discovering latent content similarities. The processing involved, however, can require relatively large matrix operations. As noted above, the document vectors can easily comprise a database with upwards of 100K×100K matrix size.
From the discussion above, it should be apparent that there is a need for a text search and retrieval technique that more efficiently implements clustering to identify similar text documents in a collection. The present invention fulfills this need.
SUMMARY OF THE INVENTION
The present invention provides a system and method in which documents of a collection are represented as normalized document vectors of word frequencies. The document vector space is then partitioned into a set of disjoint clusters and a concept vector is computed for each partition, the concept vector comprising the normalized mean vector of all the documents in each partition. Documents are then reassigned to the cluster having their closest concept vector, and a new set of concept vectors for the new partitioning is computed. From an initial partitioning, the concept vectors are iteratively calculated to a stopping threshold value, leaving a concept vector subspace of the document vectors. The documents can then be projected onto the concept vector subspace to be represented as a linear combination of the concept vectors, thereby reducing the dimensionality of the document space. The reduced dimensionality provides more efficient representation of documents as linear combinations of concept vectors, and permits simpler document searching and categorization of the documents. In this way, the invention provides text search and retrieval technique that more efficiently implements clustering to identify similar text documents in a collection.
In one aspect of the invention, a disjoint clustering partition of the document vectors is computed, and a “normalized centroid” referred to as the concept vector is computed for each of these clusters. The cluster concept vectors are iteratively computed to a predetermined threshold, forming a set of concept vectors that define a vector subspace of the document vector space. Any document in the document vector space can then be represented as a linear combination in the concept vector subspace. The invention can exploit the sparsity of the text data, it can be efficiently parallelized, and converges quickly (to a local maxima). Moreover, from a statistical perspective, the invention generates concept vectors that serve as a “model” which may be used to classify future documents.
Other features and advantages of the present invention should be apparent from the following description of the preferred embodiments, which illustrate, by way of example, the principles of the invention.


REFERENCES:
patent: 4674028 (1987-06-01), Shioya et al.
patent: 5317507 (1994-05-01), Gallant
patent: 5692100 (1997-11-01), Tsuboka et al.
patent: 5692107 (1997-11-01), Simoudis et al.
patent: 5748116 (1998-05-01), Chui et al.
patent: 5787274 (1998-07-01), Agrawal et al.
patent: 5787422 (1998-07-01), Turkey et al.
patent: 5794235 (1998-08-01), Chess
patent: 5886651 (1999-03-01), Chui et al.
patent: 5893100 (1999-04-01), Chui et al.
patent: 5999927 (1999-12-01), Turkey et al.
patent: 6356898 (2002-03-01), Cohen et al.
patent: 6360227 (2002-03-01), Aggarwal et al.
Drineas et al., “Clustering in large graphs and matrices,”SODA

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Concept decomposition using clustering does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Concept decomposition using clustering, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Concept decomposition using clustering will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3065963

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.