Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-03-08
2003-07-01
Rones, Charles (Department: 2175)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000
Reexamination Certificate
active
06587848
ABSTRACT:
FIELD OF THE INVENTION
The present invention is related to methods and apparatus for performing similarity searches in documents and, more particularly, to performing such searches based on affinity lists.
BACKGROUND OF THE INVENTION
In recent years, the importance of performing interactive searches on large collections of text has increased considerably because of the rapid increase in the size of the world wide web. Many search engines, such as Yahoo! (http://www.yahoo.com), Lycos (http://www.lycos.com) and AltaVista (http:///www.altavista.com), are available in which closest sets of matches are found to sets of keywords specified by the user. For such applications, the documents searched are represented in the form of an inverted index, see, Salton G., McGill M. J., “Introduction to Modern Information Retrieval,” McGraw Hill, New York, 1983. Other access methods such as signature files exist (see, e.g., Faloutsos C., “Access Methods for Text,” ACM Computing Surveys 17, Mar. 1, 1995); Frakes W. B., Baeza-Yates R. (editors), “Information Retrieval: Data Structures and Algorithms,” Prentice Hall PTR, Upper Saddle River, N.J., 1992), though the inverted representation seems to have become the method of choice in the information retrieval domain. The inverted representation consists of lists of document identifiers, one for each word in the lexicon. For each word w, its list contains all the document identifiers, such that the corresponding documents contain that word. In addition, meta-information on word-frequency, position, or document length may be stored along with each identifier. For each user query, it suffices to examine the document identifiers in the inverted lists corresponding to the words in the query (or target).
Considerable correlations between words exist because of synonymity and different descriptions of the same underlying latent concepts. Thus, two documents containing very different vocabulary could be similar in subject material. Similarly, two documents sharing considerable vocabulary could be topically very different. While applying the method to search engines (which is a special application of similarity search, in which the target document contains very few words), this problem is observed in the form of retrieval incompleteness and inaccuracy. For example, while querying on cats, one may miss documents containing a description on the feline species, which do not explicitly contain the word “cat.” Methods exist for query extension via adhoc feedback, relevance feedback, or automatic expansion, see, e.g., Hearst M. A., “Improving Full-Text Precision on Short Queries using Simple Constraints,” Proceedings of the Symposium on Document Analysis and Information Retrieval, April 1996; Mitra M., Singhal A., Buckley C., “Improving Automatic Query Expansion,” Proceedings of the ACM SIGIR Conference 1998, pages 206-214; Rocchio J. J., “Relevance feedback in Infomation Retrieval,” Journal of the American Society for Information Science, 34(4), 262-280; Salton G., Buckley C., “Improving Retrieval Performance by Relevance Feedback,” Journal of the American Society for Information Science, 41(4):288-297, 1990; and Xu J., Croft W. B., “Query Expansion Using Local and Global Document Analysis,” Proceedings of the ACM SIGIR Conference, 1996. Another well known problem is that of polysemy, in which the same word could refer to multiple concepts in the description. For example, the word “jaguar” could refer to an automobile, or it could refer to the cat. Clearly, the ambiguity of the term can be resolved only by viewing it in the context of other terms in the document.
A well known method for improving the quality of a similarity search in text is called Latent Semantic Indexing (LSI) in which the data is transformed into a new concept space, see, e.g., Dumais S., Furnas G., Landauer T., Deerwester S., “Using Latent Semantic Indexing to Improve Information Retrieval,” Proceedings of the ACM SIGCHI 1988, pages 281-285. This concept space depends upon the document collection in question, since different collections would have different sets of concepts. Latent semantic indexing is a technique which tries to capture this hidden structure using techniques from linear algebra. The idea in LSI is to project the data into a small subspace of the original data such that the noise effects of synonymy and polysemy are removed. A more detailed description may be found in, e.g., Kleinberg J., Tomkins A., “Applications of Linear Algebra in Information Retrieval and Hypertext Analysis,” Proceedings of the ACM SIGMOD Conference, 1999.
LSI transforms the data from the sparse indexable representation (with the inverted index) in a very high overall dimensionality to a representation in the real space which is no longer sparse. Even though the new representation is of much lower overall dimensionality (typically about 200 or so dimensions are needed to represent the concept space), it is beyond the capacity of spatial indexing structures to handle effectively. R-Trees are discussed in, e.g., Guttman, A., “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proceedings of the ACM SIGMOD Conference, 47-57, 1984.
Most of the existing art in improving search engine performance has been on the modification of user queries in order to improve the quality of the results. These techniques generally fall into two broad categories: (i) global document analysis, and (ii) local document analysis, see, e.g., Xu J., Croft W. B., “Query Expansion Using Local and Global Document Analysis,” Proceedings of the ACM SIGIR Conference, 1996. In global analysis, the relationships of the words in the document collection are used in order to expand the queries, see, e.g., Crouch C. J., Yang B., “Experiments in Automatic Statistical Thesaurus Construction,” Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval,” pages 77-88, 1992; Jing Y., “Automatic Construction of Association Thesaurus for Information Retrieval,” Integr. Study Artificial Intelligence and Cognitive Science Appl. Epistomel. (Belgium), Vol. 15, No. 1-2, pp. 7-34, 1998; Qiu Y., Frei H. P., “Concept Based Query Expansion,” Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pages 160-169, 1993; and Voorhees E., “Query Expansion Using Lexical Semantic Relations,” Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 61-69. The more popular method for query expansion is local analysis, in which the top ranked documents returned by the original query are assumed relevant, see, e.g., Buckley C., Mitra M., Walz J., Cardie C., “Using Clustering and SuperConcepts Within Smart: TREC 6,” Proceedings of the TREC Conference, 1998; Mitra M., Singhal A., Buckley C., “Improving Automatic Query Expansion,” Proceedings of the ACM SIGIR Conference 1998, pages 206-214; Rocchio J. J., “Relevance feedback in Infomation Retrieval,” Journal of the American Society for Information Science, 34(4), 262-280; Salton G., Buckley C., “Improving Retrieval Performance by Relevance Feedback,” Journal of the American Society for Information Science, 41(4):288-297, 1990. Terms from these returned documents are then used in order to expand the user queries. In addition, boolean filters may be used in order to decide the relevance of the documents in the returned results, see, e.g., Hearst M. A., “Improving Full-Text Precision on Short Queries Using Simple Constraints,” Proceedings of the Symposium on Document Analysis and Information Retrieval, April 1996. These techniques are relevant to short user queries only, because for the case of very large documents the subject can be inferred only from the overall distribution of words in the initial document, and the initial query itself may have a considerable number of words which are unrelated to the document content. Furthermore, we will see that the inverted representation is not suitable from the performance perspective for large document targets. This also means that
Aggarwal Charu C.
Yu Philip Shi-Lung
Rones Charles
Ryan & Mason & Lewis, LLP
Zarick Gail H.
LandOfFree
Methods and apparatus for performing an affinity based... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Methods and apparatus for performing an affinity based..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and apparatus for performing an affinity based... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3017149