Methods and apparatus for similarity text search based on...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Methods and apparatus for similarity text search based on... Methods and apparatus for similarity text search based on...

: 2000-01-28
: 2003-04-01
: Vu, Kim (Department: 2172)
: Data processing: database and file management or data structures
: Database design
: Data structure types

: C707S793000, C707S793000, C707S793000, C707S793000, C707S793000
: Reexamination Certificate
: active
: 06542889
: ABSTRACT:

FIELD OF THE INVENTION
The present invention is related to methods and apparatus for performing similarity searches in documents and, more particularly, to performing such searches based on conceptual indexing.
BACKGROUND OF THE INVENTION
In recent years, the large amounts of data available on the world wide web has made-effective similarity search and retrieval an important issue. A similarity search has uses in many web applications such as search engines or in providing close matches for user queries. A related area is that of document-to-document queries, in which the target is an entire document, as opposed to a small number of words. Such a system has considerable use in recommender systems for a library or web application, in which it is desirable to find the closest matches to a document which is currently being browsed.
The similarity search has proven to pose an interesting problem in the text domain because of the unusually large size of dimensionality associated with the documents as compared to the size of the documents. For example, a typical document on the world wide web contains an average of about 60 to 80 words, out of a lexicon of about 100,000 words. Considerable correlations between words exist because of synonymity and different descriptions of the same underlying latent concepts. Thus, two documents containing very different vocabulary could be similar in subject material. While applying the typical similarity search method to search engines (which are a special case, in which the target document contains very few words), this problem is also observed. For example, while querying on cats, one may miss documents containing,a description on the feline species, which does not explicitly contain the word “cat.” Methods exist for query expansion by including synonyms, though these methods are strictly applicable to the short and specific queries of search engines, and are too inefficient and inaccurate to generalize to document-to-document similarity queries, in which the target is itself a document containing a multitude of concepts and subjects.
Another well known problem is that of polysemy, in which the same word could refer to multiple concepts in the description. For example, the word “virus” could refer to a computer virus, or to a biological virus. Clearly, the ambiguity of the term can be resolved only by using it in the context of other terms in the document.
A well known method in the prior art for improving the quality of similarity search in text is called Latent Semantic Indexing (LSI), in which the data is transformed into a new space in which each dimension corresponds to a concept in the underlying data. This concept space depends upon the document collection in question, since different collections would have different sets of concepts. LSI is a technique which tries to capture this hidden structure using techniques from linear algebra. The idea in LSI is to project the data into a small subspace of the original data such that the noise effects, redundancies, and ambiguities are largely removed. The LSI approach is discussed in Kleinberg J., Tomkins A., “Applications of linear algebra in information retrieval and hypertext analysis,” Proceedings of the ACM SIGMOD Conference, 1999, the disclosure of which is incorporated by reference herein.
Note that LSI transforms the data from a sparse indexable representation, using an inverted index, to a representation in the real space which is no longer sparse. As is well known, the inverted index is an index in which a sparse database is indexed using its attributes. Even though the new representation is of much lower dimensionality (typically about 200 or so dimensions are needed to represent the concept space), this dimensionality is beyond the capacity of indexing structures such as R-Trees to handle. R-Trees are discussed in Guttman, A., “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proceedings of the ACM SIGMOD Conference, 47-57, 1984. Thus, we have a difficult situation: if the data is represented in the original format using the inverted index, it is basically useless for performing a high quality document-to-document similarity search; on the other hand, when the data is transformed using latent semantic indexing, we have a data set which cannot be indexed effectively. Thus, if we have a very large collection of documents, we would either be reduced to using a sequential scan in order to perform conceptual similarity search, or have to do with lower quality search results using the original representation and ignoring the problems of synonymy and polysemy.
Another limitation of the prior art in performing document-to-document queries effectively is that a large fraction of the words in the document are unrelated to the general subject of the page. These words increase the noise effects associated with the target document, and reduce the likelihood of high quality search results. This is a problem that even latent semantic indexing cannot resolve very effectively.
Accordingly, a need exists for methods and apparatus for providing a high quality similarity search and a relatively low cost indexing methodology for use, for example, in accordance with document-to-document search applications.
SUMMARY OF THE INVENTION
The present invention provides methods and apparatus for performing similarity search in documents. A framework for providing conceptual similarity between documents is provided using clustering. The methodology of the invention also provides effective indexing techniques so that a similarity search can be performed on large collections of documents by accessing only a small percentage of the data. We demonstrate that our methodology performs substantially better than the standard method of using similarity search on the inverted index both in terms of quality and search efficiency.
In one aspect of the invention, a method of performing a conceptual similarity search comprises the steps of: generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search; building a conceptual index of documents with the one or more word-chains; and evaluating a similarity query using the conceptual index. As will be explained in detail below, the present invention provides for the creation of a concept space comprising a new axis system which is defined by a set of these word-chains. Each word-chain comprises a set of closely related words (or topical vocabulary) associated with a given cluster, and the conceptual strength of document with respect to a word-chain defines how closely that document matches this vocabulary.
The word-chain generating step may comprise the steps of: initializing one or more word-chains to one or more sets of randomly selected documents; assigning one or more other documents to the one or more sets of randomly selected documents; concatenating the one or more documents in each set and removing less frequently occurring words from each word-chain; and merging the word-chains. The initializing, assigning, concatenating and merging steps are then iteratively repeated to generate a final set of word-chains.
The index building step may comprise the steps of: for each word-chain, finding the one or more documents with conceptual similarity to the word-chain; and retaining a list of identities of the one or more documents which have conceptual similarity not less than a predefined threshold value. The document identity preferably comprises a unique integer value.
The evaluating step preferably returns one or more of the closest documents resulting from the search; one or more matching word-chains in the one or more documents; and one or more matching topical words of the one or more documents. Particularly, the evaluating step may comprise the step of generating a conceptual representation of a target document associated with the similarity query. In one embodiment, the target document conceptual representation generating step comprises the steps of: calculating a similarity measure between the target documen

Affiliated with

Aggarwal Charu C.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Yu Philip Shi-Lung

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Ryan & Mason & Lewis, LLP

Law Firm

[ 0.00 ] – not rated yet Voters 0 Comments 0

Troung Cam-Y

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Vu Kim

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Zarick Gail H.

Attorney

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Methods and apparatus for similarity text search based on... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Methods and apparatus for similarity text search based on..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and apparatus for similarity text search based on... will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-3027904

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure