Method for estimating coverage of web search engines

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06711568

ABSTRACT:

FIELD OF THE INVENTION
This invention relates generally to search engines used on the World Wide Web, and more particularly to estimating the relative sizes and overlap of indexes maintained by these search engines.
BACKGROUND OF THE INVENTION
In recent years, there has been a dramatic increase in the amount of content that is available on the World Wide Web (the “Web”). Typically, the content is organized as HTML Web pages. The total number of pages accessible through the Web is estimated to number in the hundreds of millions. In order to locate pages of interest, a large number of public search engines are currently in operation, for example, AltaVista, Infoseek, HotBot, Excite, and many others.
A typical search engine will periodically scan the Web with a “spider” or “web crawler” to locate new or changed Web pages. The pages are parsed into an index of words maintained by the search engine. The index correlates words to page locations. Then, using a query interface, users can rapidly locate pages having specific content by combining keywords with logical operators in queries. Usually, the search engine will return a rank ordered list of pages which satisfy a query. The pages are identified by their Universal Resource Locators (URLs), and a short excerpt. The user can than use a standard Web browser to download interesting pages by specifying their URLs, most often using “hot” links.
Another type of search engine, called a meta-search engine—e.g., “http://www.metacrawler.com” which accepts a query from a user, and passes the query to a number of conventional search engines. Meta-search engines may well be useful if the amount of overlap between indexes of popular search engines is low.
Therefore, users and designers of search engines are often interested in knowing how good the coverage is of different search engines. Here, coverage means the relative sizes of the indexes, i.e., the number of pages indexed, and the relative amount of overlap between indexes, i.e., the number of pages of one search engine indexed by another.
However, currently there is no good way to measure relative coverage of public search engines. Although many studies have tried to measure coverage, the studies often reach contradictory conclusions since no standardized test has been defined. A large bibliography of such studies is maintained at: http://www.ub2.1u.se/desire/radar/lit-about-search-services.html.
Most comparisons are highly subjective since they tend to rely on information such as spider-access logs obtained from a few sites. Often, they make size estimates by sampling with a few arbitrary chosen queries which are subject to various biases and/or using estimates provided by the search engines themselves. In either case, this makes the estimates unreliable.
For example, if a search engine claims a search result of about 10,000 pages, then the result may well include duplicate pages, aliased URLs, pages which since have been deleted. In fact, the search engine itself may only scan a small part of its index, say 10%, and return the first couple of hundred pages. The total number of qualifying pages that it thinks it has indexed and could have returned is just an extrapolation.
Therefore, it is desired to provided a standardized method for measuring the relative coverage of search engines. It should be possible to work the method without having privileged access to the internals of the search engines. That is, it should be possible to estimate the coverage from public access points.
SUMMARY OF THE INVENTION
A method is provided for estimating coverage of search engines used with the World Wide Web. Each search engine maintains an index of words of pages located at specific addresses of a network. A random query is generated. The random query is a logical combination of words found in a subset of Web pages. Preferably, the training set
311
of pages is representative of the pages on the Web in general, or possibly a particular domain.
The random query is submitted to a first search engine. The first search engine returns a set of addresses in response. The set of addresses identify pages indexed by the first search engine. A particular address identifying a sample page is randomly selected from this set, and a strong query is generated for the sample page. The strong query is highly dependent on the content of the sample page. The strong query is submitted to other search engines.
The results received from the other search engines are compared to information about the sample page to determine if the other search engines have indexed the sample page. In other words, random queries are used to extract random pages from one search engine, and strong queries derived from the random pages are used to test if other search engines have indexed the page. Thus, the relative size and overlap between the first and other search engines can be estimated.
In one aspect of the invention, a lexicon of words is constructed from the a training set of pages, and the frequencies of unique words in the lexicon is determined. The lexicon and word frequencies can be used to select words combined into the random query. The random query can be disjunctive or conjunctive. In another aspect of the invention, the strong query is a disjunction of a two conjunctive queries.


REFERENCES:
patent: 5701469 (1997-12-01), Brandli et al.
patent: 5842206 (1998-11-01), Sotomayor
patent: 5848410 (1998-12-01), Walls et al.
patent: 5864863 (1999-01-01), Burrows
patent: 5873079 (1999-02-01), Davis, III et al.
patent: 5873080 (1999-02-01), Coden et al.
patent: 5911139 (1999-06-01), Jain et al.
patent: 5913215 (1999-06-01), Rubinstein et al.
patent: 5926812 (1999-07-01), Hilsenrath et al.
patent: 5933822 (1999-08-01), Braden-Harder et al.
patent: 6094657 (2000-07-01), Hailpern et al.
patent: 6285999 (2001-09-01), Page
patent: 6539373 (2003-03-01), Guha
Kawano, “Mondou: Web search engine with textual data mining”, IEEE, Aug. 1997, pp. 402-405.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for estimating coverage of web search engines does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for estimating coverage of web search engines, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for estimating coverage of web search engines will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3209579

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.