Collaborative team crawling:Large scale information...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C345S440000, C709S241000, C709S201000

Reexamination Certificate

active

06182085

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a scalable method for collaborative web crawling and information processing. More particularly, the invention concerns a distributed collection of web crawlers used to cover a large portion of cyberspace where the crawlers share the overall cyberspace crawling and collaborate to maximally utilize computing resources.
2. Description of the Related Art
Cyberspace is a popular way for people and industries to rapidly gather information. However, because of the immense amount of information available in cyberspace, automatic information gathering, screening, and delivering systems have become a necessity.
One such system is the Grand Central Station (GCS) system being developed at the IBM Almaden Research Center in San Jose, Calif. This system combines numerous aspects of information discovery and dissemination into a single, convenient system. GCS performs many functions by providing an infrastructure that supports the discovery and tracking of information in a digital domain such as cyberspace, and disseminates these discoveries to those who have an interest.
One of the key components of virtually all information discovery system infrastructures accessing cyberspace (i.e. the Internet) is a Gatherer that systematically gathers data sources (crawls) and transforms or summarizes them into a single, uniform, metadata format. This format generally reflects the format found in the system used by the person requesting the information. Webcasting technology referred to as an “Internet push” is used to match the summarized information with users' profiles and re-channel each piece of information to those who need it.
To assist in gathering information, cyberspace data located at a particular site being reviewed is logically arranged into a graph or tree, commonly referred to as a directed graph. The Gatherer traverses this web-graph looking for desired information. Because of the sheer volume of data available, the graph reviewed might be very large in size. For example, a directed graph representing one million pieces of potentially interesting information would be enormous in size and complexity. A large graph would require a considerable amount of time for the Gatherer to process the information.
To make a Gatherer more efficient, a system that allows partitioning of a web-space directed graph is needed. Preferably, the system would also allow “team-crawling,” where web-space information could be gathered using multiple processors assigned to crawling parts of the same space. However, for such a partitioning to work, problems encountered with automatically partitioning the cyberspace for load balancing among gathering processors needs to be overcome. This is a different and much more challenging problem than discussed in current traditional graph partitioning problem studies dealing with very large scale integrated (VLSI) circuit design and parallel scientific computing.
For example, one difficulty comes from the fact that a web-space directed graph, used to model the information at the site, is usually not discoverable before the crawling occurs. This is because web sites are dynamic, that is, they are always changing, having information added and deleted up to the point the crawling actually takes place. This constant changing of the information—and therefore the directed graph used to model the information—prevents directly applying the previously mentioned graph partitioning methods that are designed for static (non-changing) graphs. This lack of full knowledge of a web-graph construct before a web space is partitioned also requires the amount of load and the number of hyperlinks across a partition to be changeable at any stage of collaborative crawling, and hence dynamic re-partitioning and load re-balancing would also be necessary.
Another problem that would need to be overcome is the addressing problem that arises in attempting to partition a web-space. For example, given a uniform resource locator (URL)—a commonly used designator for the location of a piece of information (object)—a quick decision needs to be made as to which partition it would belong. Depending upon the partition, it would then be sent to a designated processor for crawling and processing. Further, because the web-graph is dynamic, a problem can arise in simply organizing a partition.
SUMMARY OF THE INVENTION
Broadly, the present invention concerns a method using multiple processors for collaborative web crawling and information processing. More particularly, the invention concerns a distributed collection of web crawlers, each of which is used to cover a logical partition created within a cyberspace, where the crawlers share the overall cyberspace crawling and collaborate to maximize the use of available computing resources.
In one embodiment, the invention may be implemented to provide a method to dynamically partition a cyberspace and load balance crawling of the space across one or more Gatherers used by the web-crawling system. One such version includes using a hierarchial structure of URL names—an intermediate structure called a superpage—used as a basic unit for top level partitioning, where a superpage is a collection of URLs that share some initial sub-sequence of their URL names.
For example, for a URL such as “cs.cmu.edu/groups/parallel/parallel.html”—a specification of a path in a tree—may be viewed as a sequence of tokens (edu, cmu, cs, groups, parallel, parallel.html.) In this embodiment, a superpage whose initial tokens form “cs.cmu.edu/groups/parallel” can be formed, and another superpage whose initial tokens form “math.cmu.edu” may be formed. In one embodiment, this set of superpages may be formed dynamically during the crawling to accommodate any new information. By partitioning and re-partitioning superpages, the method dynamically balances a processing load incurred in information gathering.
The methods of the current invention implement an explicitly structured web-graph where every URL belongs to a superpage. Each superpage may contain pages that are reasonably local to each other, both in physical addresses and in hyperlink connections. If this web-graph is applied onto a set of superpages that are created dynamically, a smaller “coarsened” image of the original web-graph is obtained where this coarsened image approximates the larger web-graph.
In this invention, superpages may be automatically recognized and generated. Necessary information is obtained and maintained to measure the “volumes” of superpages as well as any pattern of connections among superpages. In another embodiment, an access rate for each processor to these superpages is maintained to determine if different processors have different access rates to each superpage. For example, a crawler located at IBM-Almaden in the United States may have a longer access time to a server located in Japan than a crawler located in Japan.
Once a set of superpages is formed and processor statistical data is obtained, the method employs a partitioning method to automatically generate favored partitions for the superpages, and maps them among available processors. The current invention can be used when all information crawlers/gatherers are located on a single, tightly coupled high-performance system as well as when they are scattered in any distributed environment.
The method allows the processors to communicate with each other to coordinate and handle hyperlinks across a partition. A communication buffer—referred to in this application as “Tspaces”—is used to support fast and inter-process communication. Logically, a Tspaces can be viewed as a large “white-board”, or global communication buffer accessible by all processors. This function is described, for example, in IBM TSPACES, Peter Wyckoff, et al, IBM Sys. Journal, August 1998 to appear.
In another embodiment, the invention may be implemented to provide a digital signal processing system used to implement the method aspects of the present invention. In one embodiment, this system may include

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Collaborative team crawling:Large scale information... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Collaborative team crawling:Large scale information..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Collaborative team crawling:Large scale information... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2545627

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.