Method and apparatus for finding mirrored hosts by analyzing...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C709S203000, C709S230000

Reexamination Certificate

active

06286006

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates generally to a method and apparatus for finding mirrored hosts and, specifically, to a method and apparatus for finding mirrored hosts by analyzing connectivity and naming structures of the host and of the web pages of the host.
In recent years, the World Wide Web (“the web”) has grown hugely in popularity and use. Currently, almost any type of information can be found on the web if one knows where to look. Knowing where to look has increasingly become problematic because the number of web sites that make up the web have grown at an astounding rate since the early 1990s. In recent years, increasingly sophisticated software, such as search engines and web browsers have been developed that allow users of the web to locate information in the web. Other software, such as proxy servers improve the speed and security of web usage.
A web crawler is a software program that fetches a set of pages from the web by following hyperlinks between the pages. Search engines, such as Compaq Computer Corporation's Alta Vista search engine, employ crawlers to build the web page indexes used by the search engine. Web browsers are applications that fetch and display web pages to a user. Proxy servers (proxies) fetch web pages from web server systems on behalf of web browsers. For efficiency reasons, proxies and browsers sometimes cache web pages (that is, store their content locally). Thus, if a cached page is requested a second time, it can be retrieved from a local cache.
A page in the web is accessed by its web address, also called a Uniform Resource Locator (URL). The URL of a web page has three parts: 1) an access type (such as “http”) 2) a host name, which identifies the host on which the page is stored, and 3) a path, which specifies a location within the host. A web site is made up of one or more web pages. As shown in
FIG. 7
, the format of a URL of a web page looks like:
<access type>://<host>/<path>
where <host> is the name of the web server that stores the web site and <path> is the path of the page within the web server.
It has become increasingly common to duplicate all or part of certain popular web sites. For example, download hosts for certain popular software are often “mirrored” so that users can obtain the same downloadable software from any one of the mirrored hosts. Mirroring is the systematic replication of content across hosts. Mirroring happens when distinct hosts provide access to copies of the same data. Because mirrored hosts allow users to obtain the same information from any of the mirrored hosts, mirroring helps avoid bottlenecks at popular hosts. Hosts are mirrored for a variety of other reasons. Mirrored hosts may have identical page structures or they may contain only certain pages and page structures that are identical. In this document, two separate tests are used when determining mirrors. In a first test, two hosts, A and B are “mirrors” if and only if for every document on host A there is a highly similar document on B with the same path and vice versa. A second test categorizes pairs of hosts according to a plurality of mirroring categories, where the categories represent degrees of miroring. The two hosts do not have to be exactly matched in structure and/or content to be mirrored hosts.
Crawlers, search engines, and proxy servers all fetch large numbers of pages on the web. If these programs could detect mirroring in hosts, they could refrain from fetching content from all but one of the mirrored hosts, thus reducing the number of pages fetched and improving their overall performance. Given a large list of URLs encountered on the web (such as a list collected by a crawler of the list of URLs viewed by a central proxy of a large Internet Service Provider) it is desirable to be able to determine which hosts are mirrored. Some specific examples are provided below.
Often search engines index only one copy of a mirrored page. In the process, they may fetch replicas and discard them. If mirroring information were available, a search engine could avoid fetching replicas from known mirrored hosts. The search engine could also distribute fetches of the remaining pages between the mirrors for load balancing, or choose the best mirror in terms of response time.
Proxy servers and web browsers maintain cached copies of downloaded pages to avoid re-fetching. The effectiveness of such caches can be increased if mirroring information is available. When a URL needs to be fetched, the cache is first checked. If a requested page has not yet been fetched, but it is determined that a page from a mirrored host with the same path has been fetched and is available in the cache, the cached mirror page can be used instead of fetching the requested page.
Thus, the ability to identify mirrored hosts would improve the speed and efficiency of operation of software accessing the world wide web.
Certain conventional web crawling software are able to identify some mirrored web sites by using Domain Name Server (DNS) lookup. When a crawler fetches a URL, it needs to first convert the hostname of the URL to a corresponding Internet Protocol (IP) address to establish a network connection. Such lookups are done using a service known as DNS. A DNS lookup returns one or more IP address for each hostname. Crawlers usually treat hosts that have an IP addresses in common as mirrors to avoid redundant fetching. This method does not always identify all mirrored hosts and may mis-identify some hosts as mirrored that are not mirrored. For example, a “virtual host” is a host that hosts more than one web site but has a single IP address. The web sites hosted by a virtual host web server, while all having the same IP address are not necessarily mirrors. Similarly, not all mirrored hosts share a common IP address. In addition, some hosts may have more than one IP address. Thus, IP matching alone is not always sufficient to prove that two hosts are mirrors of each other.
SUMMARY OF THE INVENTION
The described embodiment of the present invention addresses not only the problem of finding identical mirror hosts, but also the problem of finding hosts that are not completely identical, but contain a significant amount of shared content. This information is useful in understanding the composition of the web and the collaborations ongoing between principals on the web.
The described embodiment of the invention detects mirrored host pairs using information about a large set of pages, including URLs. The identities of the detected mirrored hosts are then saved so that browsers, crawlers, proxy servers, or the like can correctly identify mirrored web sites. The described embodiments of the present invention look at the URLs of pages hosts to determine whether the hosts are potentially mirrored.
In accordance with the purpose of the invention, as embodied and broadly described herein, the invention relates to a method of determining mirrored web hosts, comprising: receiving information about the addresses of a plurality of web sites stored on a plurality of hosts; determining a plurality of terms of the URLs associated with every host; weighting the terms in inverse proportion to frequency; determining a similarity score for host pair in accordance with the weighted terms; and outputting a list of potential pairs of mirrored hosts in accordance with their similarity scores.
Advantages of the invention will be set forth in part in the description which follows and in part will be obvious from the description or may be learned by practice of the invention. The objects and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims and equivalents.


REFERENCES:
patent: 5895470 (1999-04-01), Pirolli et al.
patent: 5909677 (1999-06-01), Broder et al.
patent: 5935207 (1999-08-01), Logue et al.
patent: 5978797 (1999-11-01), Yianilos
patent: 5991714 (1999-11-01), Shaner
patent: 6105019 (2000-08-01), Burrows
patent: 6119124 (2000-09-01), Broder et

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for finding mirrored hosts by analyzing... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for finding mirrored hosts by analyzing..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for finding mirrored hosts by analyzing... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2470609

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.