System and method for efficient representation of data set...

Electrical computers and digital processing systems: multicomput – Computer network managing

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C709S218000, C709S215000, C709S245000, C709S216000, C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06301614

ABSTRACT:

The present invention relates to a system and method for representation of document addresses in a web crawler and, more particularly, to a method for efficiently representing the addresses of downloaded documents even when memory space is relatively small.
BACKGROUND OF THE INVENTION
Documents on interconnected computer networks are typically stored on numerous host computers that are connected over the networks. For example, so-called “web pages” may be stored on the global computer network known as the Internet, which includes the world wide web. Web pages can also be stored on Intranets, which are typically private networks maintained by corporations, government entities, and other groups. Each web page, whether on the world wide web or an Intranet, has a distinct address called its uniform resource locator (URL), which at least in part identifies the location or host computer of the web page. Many of the documents on Intranets and the world wide web are written in standard document description languages (e.g., HTML, XML). Theses languages allow an author of a document to create hypertext links to other documents. Hypertext links allow a reader of a web page to quickly move to another web page by clicking on the links. These links are typically highlighted in the original web page. A web page containing hypertext links to other web pages generally refers to those pages by their URL's. Links in a web page may refer to web pages that are stored in the same or different host computers.
A web crawler is a program that automatically finds and downloads documents from host computers in an Intranet or the world wide web. When a web crawler is given a set of starting URL's, the web crawler downloads the corresponding documents, then the web crawler extracts any URL's contained in those downloaded documents. Before the web crawler downloads the documents associated with the newly discovered URL's, the web crawler needs to find out whether these documents have already been downloaded. If the documents associated with the newly discovered URL's have not been downloaded, the web crawler downloads the documents and extracts any URL's contained in them. This process repeats indefinitely or until a predetermined stop condition occurs.
Typically, to find out whether the documents associated with a set of discovered URL's have already been downloaded, the web crawler checks a directory of downloaded document addresses. The directory stores the URL's of the downloaded documents, or representations of the URL's. The set of downloaded document addresses could potentially contain addresses of every document on the world wide web. As of 1999 there were approximately 500 million web pages on the world wide web and the number is continuously growing. Even Intranets can store millions of web pages. Thus, web crawlers need efficient data structures to keep track of downloaded documents and any discovered addresses of documents to be downloaded. Such data structures are needed to facilitate fast data checking and to avoid downloading a document multiple times.
One example of a known prior art method designed to facilitate fast data checking and to avoid downloading a document multiple times is the method implemented by the Scooter web crawler used by Alta Visa. In the Scooter web crawler, the set of downloaded document addresses is represented by a set of corresponding fingerprints. Each fingerprint in the set of fingerprints is a fixed-size numerical checksum, calculated directly from its corresponding URL.
For fast data access, the Scooter web crawler stores the set of fingerprints entirely in main memory. Due to the volume of documents on the world wide web, Scooter requires an extremely large main memory for storage of the directory of known web pages. The present invention provides more efficient document address representation and storage methods that avoid certain of the disadvantages and inefficiencies in the prior art.
SUMMARY OF THE INVENTION
The present invention allows an efficient representation of a set of downloaded document addresses using a bounded main memory and an unbounded disk file. This invention also provides efficient address lookup operations.
When a URL is found by the web crawler in a downloaded document, that URL is converted into a fixed size numerical representation based at least in part on the host component of the corresponding URL. The URL's numerical representation is systematically compared to a structured set of stored numerical representations (converted from downloaded document addresses) in multiple memory caches and a disk file. If the new numerical representation is not found in the set of stored numerical representations, the URL's numerical representation is added to the set and its corresponding document is scheduled for downloading.
Main memory usage is user configurable and most of the fixed-size numerical representations of URL's are stored on a disk file. While most of the fixed-size numerical representations of URL's are stored on the disk file, data look-up remains fast because an in-memory cache is used to store the numerical representations of recently looked-up URL's, another in-memory cache is used to store recently added numerical representations, and an index for the disk file is used to reduce the number of disk reads performed by the operating system.
The present application is applicable to both Internet and Intranet web crawlers.


REFERENCES:
patent: 5864852 (1999-01-01), Luotonen
patent: 5898836 (1999-04-01), Freivald et al.
patent: 5974455 (1999-10-01), Monier
patent: 6094649 (2000-07-01), Bowen et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for efficient representation of data set... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for efficient representation of data set..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for efficient representation of data set... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2587407

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.