Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2001-06-20
2004-09-21
Amsbury, Wayne (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
Reexamination Certificate
active
06795820
ABSTRACT:
FIELD OF THE INVENTION
The present invention is generally directed to the field of information search and retrieval, and more particularly to techniques for locating and ranking documents and/or other forms of information that are contained in multiple collections accessible via a network.
BACKGROUND OF THE INVENTION
Computer networking technology has made large quantities of digital content available to users, resulting in a phenomenon popularly known as information overload; users have access to much more information and entertainment than they can absorb. Significant practical and commercial value has therefore been provided by search technologies, whose goal is to identify the information that is of greatest utility to a user within a given content collection.
The quality of a search is typically quantified by two measures. First, a search should find all the information in a collection that is relevant to a given query. Second, it should suppress information that is irrelevant to the query. These two measures of success correspond to recall and precision, respectively. A search is considered less effective to the extent that it cannot maximize both measures simultaneously. Thus, while one may be able to increase recall by relaxing parameters of the search, such a result may be achieved only at the expense of precision, in which case the overall effectiveness of the search has not been enhanced.
A metasearch combines results from more than one search, with each search typically being conducted over a different content collection. Often, the various content collections are respectively associated with different information resources, e.g. different file servers or databases, in which case the metasearch is sometimes referred to as a distributed search. The present invention is concerned with the difficulty of maximizing both the recall and precision of a metasearch, particularly one that is conducted via distributed resources. The following discussion explains the sources of such difficulty.
For simplicity of exposition, the issues will be discussed herein with reference to keyword-based queries of text-based content. As practitioners familiar with the field will recognize, the disclosed principles are easily generalized to queries of text-based content that are not purely keyword-based (such as natural-language queries into parsed documents), as well as to queries of content that is not text-based (such as digital sounds and images). The applicability of the present invention to methods for processing such queries will be readily apparent to those skilled in the art.
To facilitate an understanding of the invention, the following definitions are used in the context of exemplary keyword-based searches that are employed to describe the invention. A “term” is defined to be a word or a phrase. A “query” is a set (mathematically, a bag) of terms that describes what is being sought by the user. A “document” is a pre-existing set of terms. A “collection” is a pre-existing set of documents. A “metacollection” is a pre-existing set of collections.
A ranked search is the procedure of issuing a query against a collection and finding the documents that score highest with respect to that query and that collection. The dependence of each score on the entire collection often stems from the well-known technique of weighting most strongly those search terms that are least common in the collection. For example, the query “high-tech farming” would be likely to select the few documents in a computer collection that contain the term “farming”, and the few documents in an agriculture collection that contain the term “high-tech”.
A metasearch is the procedure of responding to a query against a metacollection by combining results from multiple searches. For the metasearch to be maximally precise, it should find the documents that score highest with respect to the metacollection, not those that score highest with respect to the individual collections in which they reside. For example, in a metasearch over the two aforementioned collections, if a query contains the term “computer,” an incorrect implementation would give undue weight to computer-related documents that appear in the agriculture collection. The practical impacts of this effect are substantial to the extent that a metacollection is used to cull information from diverse collections, each with a different specialty or focus.
A process that executes an individual search is called a search engine. A process that invokes search engines and combines results is known as a metasearch engine.
FIG. 1
depicts the general components of a metasearch system. Typically, the user presents a query to a metasearch engine
10
. The metasearch engine forwards this query on to multiple search engines
12
a
,
12
b
. . .
12
n
, each of which is associated with a collection
14
a
-
14
n
of information content, e.g. documents
15
. Most documents are likely to appear in only one collection. However, some documents can appear in more than one collection, as depicted by the overlap of the sets of documents
15
in collections
14
b
and
14
n
. In such a case, multiple references to a document can appear in the results of a metasearch which employs both of these collections. A well-designed metasearch engine attempts to remove duplicates whenever possible.
The relationship between search engines and collections need not be one-to-one. For example, as depicted in
FIG. 1
, two different search engines
12
b
and
12
c
may both execute a query against the same collection
14
b
. In the context of the present invention, this situation is considered to be within the meaning of executing a query on different collections, namely the collection
14
b
as processed by the search engine
12
b
, and the collection
14
b
as processed by the search engine
12
c
. In some cases, the two search engines could operate with different sets of heuristics. In such a situation the two search engines might produce different results, e.g., different rankings within the respective documents of the same collection. In the particular situation depicted in
FIG. 1
, since some documents are common to collections
14
b
and
14
n
, three references to those documents could be returned to the metasearch engine by search engines
12
b
,
12
c
and
12
n
, respectively.
The metasearch engine
10
and the various search engines
12
execute on computers that communicate with one another via a network. In a fully distributed metasearch, each engine
10
,
12
executes on a different machine. In a less distributed system, two or more of these engines may execute on the same machine. For instance, the search engines
12
a
and
12
b
may execute on the same computer
16
, or the metasearch engine
10
and one or more of the search engines
12
may execute on the same computer. Similarly, the various collections
14
may reside in different respective storage systems, or any two or more of them can share a common storage facility. The efficiency with which information is exchanged between the metasearch engine
10
and the various search engines
12
via the network is a significant factor in the overall user experience.
In a system that implements metasearch capability, it is desirable to identify the documents that score highest with respect to the metacollection, i.e. the totality of the collections
14
a
-
14
n
. The more significant components of the system are the search engines, the metasearch engine, and the protocol by which they communicate. When the search engines exist on different machines in a distributed network, it is further desirable for the communication protocol to minimize the amount of latency perceived by the user, as well as the resource burden in terms of bandwidth and processing power.
Numerous metasearch implementations exist in the commercial world and in the academic literature. Because of fundamental differences in approach, these vary significantly in precision.
FIG. 2
illustrates a taxonomy of the various implementations for metasearch techniques. Before d
Amsbury Wayne
Burns Doane Swecker & Mathis L.L.P.
NextPage, Inc.
LandOfFree
Metasearch technique that ranks documents obtained from... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Metasearch technique that ranks documents obtained from..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Metasearch technique that ranks documents obtained from... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3249478