Network repository service for efficient web crawling

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C707S793000, C707S793000, C709S201000, C709S203000, C709S223000, C709S224000, C709S225000, C709S226000

Reexamination Certificate

active

06418453

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to searching and gathering information on computer networks. More specifically, it relates to improved techniques for gathering large amounts of information from a large number of resources on a network, e.g., web crawling.
BACKGROUND OF THE INVENTION
The world wide web (or simply, “the web”) has enjoyed explosive growth in recent years, and now contains enormous amounts of information. This information is not centrally stored, but is distributed throughout millions of web servers. Moreover, the information is not static, but is constantly changing as web servers update, add, delete, or otherwise modify the information they make available to the network.
Popular web search engines allow users to quickly search the dynamic, distributed information on the web. Because searching the web directly would take an enormous amount of time, these search engines search a centralized index that summarizes the information stored on the web. An essential component of this approach to web searching is the task of gathering information from web servers (“web crawling”) and creating the searchable index from the gathered information.
Conventional web crawling typically involves a systematic exploration of the web to discover and gather information. Because the amount of information on the entire web is so large, web crawling consumes a proportionately large amount of time and network bandwidth, and places a large burden on both the crawlers and the servers. Moreover, because information on the web is constantly and unpredictably changing, the entire web crawling procedure is periodically repeated in order to keep the index information current. If this recrawling is not performed frequently enough, the web index will contain a large amount of obsolete information for some web sites (“undercrawled sites”) whose content changes often. On the other hand, if recrawling is performed too frequently, valuable computational and network resources are wasted because a large portion of information has not changed at many web sites (“overcrawled sites”).
Conventional web crawling techniques also have the problem that they often do not discover all the information actually available on the web. Their normal strategy for discovering new information is to examine the hyperlinks within known documents. Some information, however, may not have direct hyperlinks from other documents, or may only have direct hyperlinks from other undiscovered documents. As a result, this information is not discovered, gathered, or included into the index used by the search engine.
The present inventors are not aware of any existing techniques by others that effectively address these problems. U.S. Pat. No. 5,860,071 and ATT Labs Tech. Report #97.23.1 discuss the AT&T Internet Difference Engine (AIDE). The primary purpose of AIDE is to track changes to web documents and display the update information to a user in a personalized manner. This and similar techniques are directed to the problem experienced by users who are browsing large collections of changing web documents, and want to be automatically notified when certain information of interest to them has changed. It is not directed to the problems associated with web crawling, and does not teach any solution to these problems.
SUMMARY OF THE INVENTION
To address the above problems with the current state of the art, the present inventors have developed a network repository service for efficient web crawling. The repository service supplements the functions of a web server to enable an increase in the efficiency of web crawling. In particular the repository service: (a) automatically maintains a file modification list that contains the names of files on the server that have been modified (i.e., added, deleted, or otherwise modified), together with the date and time of the file modification; and (b) provides a requesting crawler with the file modification list (or a portion of the list corresponding to a time period specified by the crawler). The repository service may also (c) limit or restrict access privileges of crawlers that do not request the file modification list, thereby protecting the server from overcrawling. The repository service enables a crawler to request the file modification list, and avoid unnecessarily recrawling files that have not been modified since its last visit, thereby preventing considerable waste of time, network bandwidth, server processing resources, and crawler processing resources. Using the file modification list, the crawler can remove all prior references to deleted files, and efficiently recrawl only those files that have been added or changed since the crawler last visited the web server.
The present technique solves the problems associated with both overcrawling and undercrawling. Because crawlers that request the file modification list will not unnecessarily recrawl unmodified files, they will no longer overcrawl web servers whose data is infrequently modified. Crawlers that do not request the file modification list, on the other hand, will have their access limited or restricted, preventing them from overcrawling the web server. The problem of undercrawling is solved by the present technique by virtue of the increased efficiency in crawling. Because all unnecessary crawling is eliminated, resources are made available for more frequent crawling of information that actually is changing, as well as for other uses. Consequently, any index produced from the information gathered by the crawler will have more current information. The present technique also has the advantage that it informs the crawlers of all new web content. As a result, web crawlers will not miss documents that are not linked to known documents, and the information gathered by the crawler will be more complete.


REFERENCES:
patent: 5845290 (1998-12-01), Yoshii
patent: 5860071 (1999-01-01), Ball et al.
patent: 5890152 (1999-03-01), Rapaport et al.
patent: 6038610 (2000-03-01), Belfiore et al.
patent: 6073135 (2000-06-01), Broder et al.
patent: 6182085 (2001-01-01), Eichstaedt et al.
patent: 6249795 (2001-06-01), Douglis
patent: 6263364 (2001-07-01), Najork et al.
patent: 6269370 (2001-07-01), Kirsch
patent: 6292894 (2001-09-01), Chipman et al.
patent: 6295529 (2001-09-01), Corston-Oliver et al.
Douglis, F. et al., The AT&T internet difference engine: tracking and viewing changes on the Web, AT&T Labs-Research Technical Report #97.23.1, Apr. 14, 1997.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Network repository service for efficient web crawling does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Network repository service for efficient web crawling, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Network repository service for efficient web crawling will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2833429

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.