Inverse inference engine for high performance web search

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06510406

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates generally to computer-based information retrieval, and more particularly to a system and method for searching databases of electronic text.
The commercial potential for information retrieval systems that can query unstructured text or multimedia collections with high speed and precision is enormous. In order to fulfill their potential, collaborative knowledge based systems like the World Wide Web (WWW) must go several steps beyond digital libraries, in terms of information retrieval technology. In order to do so, unstructured and heterogeneous bodies of information must be transformed into intelligent databases, capable of supporting decision making and timely information exchange. The dynamic and often decentralized nature of a knowledge sharing environment requires constant checking and comparison of the information content of multiple databases. Incoming information may be up-to-date, out-of-date, complementary, contradictory or redundant with respect to existing database entries. Further, in a dynamic document environment, it is often necessary to update indices and change or eliminate dead links. Moreover, it may be desirable to determine conceptual trends in a document set at a particular time. Additionally, it can be useful to compare the current document set to some earlier document set in variety of ways.
As it is generally known, information retrieval is the process of comparing document content with information need. Currently, most commercially available information retrieval engines are based on two simple but robust metrics: exact matching or the vector space model. In response to an input query, exact-match systems partition the set of documents in the collection into those documents that match the query and those that do not. The logic used in exact-match systems typically involves Boolean operators, and accordingly is very rigid: the presence or absence of a single term in a document is sufficient for retrieval or rejection of that document. In its simplest form, the exact-match model does not incorporate term weights. The exact-match model generally assumes that all documents containing the exact term(s) found in the query are equally useful. Information retrieval researchers have proposed various revisions and extensions to the basic exact-match model. In particular, the “fuzzy-set” retrieval model (Lopresti and Zhou, 1996, No. 21 in Appendix A) introduces term weights so that documents can be ranked in decreasing order relative to the frequency of occurrence of those weighted terms.
The vector space model (Salton, 1983, No. 30 in Appendix A) views documents and queries as vectors in a high-dimensional vector space, where each dimension corresponds to a possible document feature. The vector elements may be binary, as in the exact-match model, but they are usually taken to be term weights which assign “importance” values to the terms within the query or document. The term weights are usually normalized. The similarity between a given query and a document to which it is compared is considered to be the distance between the query and document vectors. The cosine similarity measure is used most frequently for this purpose. It is the normal inner product between vector elements:
cos

(
q
,
D
i
)
=
w
q
·
w
d
i
&LeftBracketingBar;
&RightBracketingBar;

w
q

&LeftBracketingBar;
&RightBracketingBar;

&LeftBracketingBar;
&RightBracketingBar;

w
d
i

&LeftBracketingBar;
&RightBracketingBar;
=

j
=
1
p



w
q
j

w
d
ij

j
=
1
p



w
q
j
2


j
=
1
p



w
d
ij
2
where q is the input query, D
i
is a column in term-document matrix, w
qj
is the weight assigned to term j in the query, w
dj
is the weight assigned to term j in document i. This similarity function gives a value of 0 when the document and query have no terms in common and a value of 1 when their vectors are identical. The vector space model ranks the documents based on their “closeness” to a query. The disadvantages of the vector space model are the assumed independence of the terms and the lack of a theoretical justification for the use of the cosine metric to measure similarity. Notice, in particular, that the cosine measure is 1 only if W
qj
=W
dj
. This is very unlikely to happen in any search, however, because of the different meanings that the weights w often assume in the contexts of a query and a document index. In fact, the weights in the document vector are an expression of some statistical measure, like the absolute frequency of occurrence of each term within a document, whereas the weights in the query vector reflect the relative importance of the terms in the query, as perceived by the user.
For any given search query, the document that is in fact the best match for the actual information needs of the user may employ synonyms for key concepts, instead of the specific keywords entered by the user. This problem of “synonymy” may result in a low similarity measure between the search query and the best match article using the cosine metric. Further, terms in the search query have meanings in the context of the search query which are not related to their meanings within individual ones of the documents being searched. This problem of “polysemy” may result in relatively high similarity measures for articles that are in fact not relevant to the information needs of the user providing the search query, when the cosine metric is employed.
Some of the most innovative search engines on the World Wide Web exploit data mining techniques to derive implicit information from link and traffic patterns. For instance, Google and CLEVER analyze the “link matrix” (hyperlink structure) of the Web. In these models, the weight of the result rankings depends on the frequency and authority of the links pointing to a page. Other information retrieval models track user's preferences through collaborative filtering, such as technology provided by Firefly Network, Inc., LikeMinds, Inc., Net Perceptions, Inc., and Alexa Internet, or employ a database of prior relevance judgements, such as technology provided by Ask Jeeves, Inc. The Direct Hit search engine offers a solution based on popularity tracking, and looks superficially like collaborative filtering (Werbach, 1999, No. 34 in Appendix A). Whereas collaborative filtering identifies clusters of associations within groups, Direct Hit passively aggregates implicit user relevance judgements around a topic. The InQuery system (Broglio et al, 1994, No. 8 in Appendix A; Rajashekar and Croft, 1995, No. 29 in Appendix A) uses Bayesian networks to describe how text and queries should be modified to identify relevant documents. InQuery focuses on automatic analysis and enhancement of queries, rather than on in-depth analysis of the documents in the database.
While many of the above techniques improve search results based on previous user's preferences, none attempts to interpret word meaning or overcome the fundamental problems of synonymy, polysemy and search by concept. These are addressed by expert systems consisting of electronic thesauri and lexical knowledge bases. The design of a lexical knowledge base in existing systems requires the involvement of a large teams of experts. It entails manual concept classification, choice of categories, and careful organization of categories into hierarchies (Bateman et al, 1990, No. 3 in Appendix A; Bouad et al, 1995, No. 7 in Appendix A; Guarino, 1997, No. 14 in Appendix A; Lenat and Guha, 1990, No. 20 in Appendix A; Mahesh, 1996, No. 23 in Appendix A; Miller, 1990, No. 25 in Appendix A; Mahesh et al, 1999, No. 24 in Appendix A; Vogel, 1997 and 1998, Nos. 31 and 32 in Appendix A). In addition, lexical knowledge bases require careful tuning and customization to different domains. Because they try to fit a preconceived logical structure to a collection of documents, lexical knowledge bases typically fail to deal effectively with heterogeneous collections such as the Web.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Inverse inference engine for high performance web search does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Inverse inference engine for high performance web search, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Inverse inference engine for high performance web search will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3045892

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.