Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2002-10-02
2004-12-07
Corrielus, Jean M. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06829599
ABSTRACT:
BACKGROUND OF INVENTION
The present invention relates generally to the field of computer-based information retrieval, and in particular to the field of search engines that facilitate access to data published on the Internet and intranets, specifically meta-search engines that exploit other information sources in order to provide a better answer to user queries.
Web search engines have facilitated the access to data published on the Internet and intranets. However, individual search engines are subject to certain limitations, and this has resulted in the design of meta-searchers that exploit other information sources (including those on the Web) in order to provide a better answer to user queries. Meta-searchers do not have their own document collections; instead, they forward user queries to external information sources in order to retrieve relevant data. They “wrap” the functionality of information sources and employ the corresponding wrappers for interaction with the remote sources.
In ordinary search engines user relevance feedback mechanisms exist, but these mechanisms assume full access to document content. However, a meta-searcher typically only receives brief document descriptions, but not the full content of documents.
In fact, certain conventional information sources (such as Library of Congress, ACM Digital Library, AltaVista, etc.) offer a “Find similar items” feature, when the user, after sending a conventional query, can request for all items “similar” to a given one, selected by the user from the query answer list. Generally, the selected document is used by the source to query its internal document collection. Finding similar documents in the collection is straightforward, as the source contains full descriptions of documents and implements conventional methods for evaluating the relevance (“distance”) between the given document and any other.
Unfortunately, this is not true for the meta-searcher, which, as explained, only receives a short summary of each document. Two immediate solutions are possible: full document retrieval and using similarity features of sources. In the first solution, the meta-searcher can retrieve all documents listed in the sources' answers, analyze and re-rank them “from scratch”. However, the document downloading takes time and therefore the complete re-ranking cannot be fulfilled on-line. In the second solution, the meta-searcher can profit from “Find similar items” features of information sources by forwarding selected documents, but it can work successfully only if most sources provide such a service. Because only few existing Web sources are adapted to search for similar documents, the majority of existing meta-searchers, such as SavvySearch (www.search.com) simply report the sources' ranks. Some others, such as MetaCrawler (www.metacrawler.com), do allow for a certain similar document search, but the “more like this” action will in fact be that of the source providing the document if that source is capable of providing such service.
Further, in meta-search engines, answers to a user query are retrieved from different information sources, but the sources' heterogeneity disallows the direct reuse of ranking or scoring information given by these sources. Thus, one of the important issues with the meta-searching is the ranking of documents received from different sources. The number of documents relevant to a user query at one information source is often large and the sources rank documents by using different ranking methods, these methods are often protected and hidden from users, thus also from the interrogating meta-searcher. As a result, meta-searchers have difficulty unifying sources' ranks and providing a final and unique ranking of answers delivered to the user.
Another problem is to give a user a high(er) satisfaction from the query answers. When formulating precise queries with an individual search engine, an experienced user can benefit from the features of the source's query language, including the attribute search, Boolean constraints, proximity operators, etc. However, even for a well-prepared query, hundreds of documents may fit the query so that, to get satisfactory results, the user query often undergoes multiple refinements.
The situation becomes more complicated in a meta-searcher, where certain important aspects of the meta-searching are purposely hidden from users: which information sources are contacted for querying, how initial queries are translated into native queries, how many items are extracted from each source, how to filter out items that do not fit the user query. All this makes the relationship between a query and the answers less obvious and thus makes the query refinement more cumbersome for the user. Since all this knowledge is encoded in different steps of the query processing, it becomes a challenge for the meta-searcher to help the user in query reformulation and in getting the most relevant answers.
A user can provide feedback on a search query result, for example by selecting and unselecting documents listed in the query answer. When a user selects relevant documents from the answer list, the meta-searcher could use a simple solution by sending the selected document as a query to an information source for querying and retrieval, but this is true only when the source uses the vector-space model (VSM). In the VSM, the relevance of a similar document is determined by comparing certain document parameters and weighting the result so as to determine a vector distance of the selected document to another one. In short, in the VSM, documents are represented as a vector of keywords. Each element in a vector will have a weight on a continuous scale. A typical approach is to take a document and process it until a list of unique words remains. This list, which contains all words in the document, is filtered through an algorithm that removes words that are too common to be searched, e.g., words like “the”, “of” and “a” are routinely filtered out. The remaining list of words is then depicted as a vector space, where each word represents a dimension. The length of the vector can be determined in a number of ways, ranging from basic algorithms which make the vector longer if a number of words occurs more often, to complex ones that take into account term frequency and inverse document frequency measures.
However, modern search engines generally use the VSM in web systems using the information retrieval technology, but a different model is used in web systems querying data in databases. This different model, called the enhanced Boolean model (EBM), where all documents in the collection, whether they satisfy the Boolean query or not, are ranked by a relevance score. In a Boolean model, documents are represented as a set. In a Boolean model set, a document is indexed by assigning a number of keywords. When a user submits a query, a similarity function will try to match the query with all documents in the index. In a strict Boolean model the similarity function will only return documents that exactly match the query given by the user. That is why most search engines use the enhanced Boolean model, which is less restrictive as it will return a list of documents that match according to a similarity percentage. No distance is determined, as there merely exists a list of ranked documents; a document with a higher ranking does then not necessarily mean that its contents is similar to that of the selected document.
Further, if the enhanced Boolean model were to be used, it could be possible to adopt the schema of learning classical Boolean queries. In machine learning, monotone Boolean queries can be efficiently learned in the polynomial time. However, assumptions imposed by the theoretical learning mechanism turn out to be too strong in the real querying, where the user cannot be forced to give feedback on each answer document or to be prohibited from altering the relevance marks on documents in successive refinements. Formally, it means the learning should use the user relevance feedback that ca
Corrielus Jean M.
Oliff & Berridg,e PLC
Xerox Corporation
LandOfFree
System and method for improving answer relevance in... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for improving answer relevance in..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for improving answer relevance in... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3336577