Method and system for incremental web crawling

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06631369

ABSTRACT:

TECHNICAL FIELD
The present invention relates generally to the fields of computerized publishing and knowledge management, and more particularly to Web crawler applications used, e.g., by Internet search engines. The invention, however, is not limited to use in a Web crawler. On the contrary, the invention could be used in a mail server, directory service, or any system requiring indexing or one-way replication of a document store.
BACKGROUND OF THE INVENTION
There has recently been a tremendous growth in the number of computers connected to the Internet. A client computer connected to the Internet can download digital information from server computers. Client application software typically accepts commands from a user and obtains data and services by sending requests to server applications running on the server computers. A number of protocols are used to exchange commands and data between computers connected to the Internet. The protocols include the File Transfer Protocol (FTP), the Hyper Text Transfer Protocol (HTTP), the Simple Mail Transfer Protocol (SMTP), and the Gopher document protocol.
The HTTP protocol is used to access data on the World Wide Web, often referred to as “the Web.” The Web is an information service on the Internet providing documents and links between documents. It is made up of numerous Web sites located around the world that maintain and distribute electronic documents. A Web site may use one or more Web server computers that store and distribute documents in a number of formats, including the Hyper Text Markup Language (HTML). An HTML document contains text and metadata (commands providing formatting information), as well as embedded links that reference other data or documents. The referenced documents may represent text, graphics, or video.
A Web browser is a client application or, preferably, an integrated operating system utility that communicates with server computers via FTP, HTTP and Gopher protocols. Web browsers receive electronic documents from the network and present them to a user.
An intranet is a local area network containing Web servers and client computers operating in a manner similar to the World Wide Web described above. Typically, all of the computers on an intranet are contained within a company or organization.
The term “search engine” is often used generically to describe both true search engines and directories, although they are not the same. Search engines typically create their listings automatically by “crawling” the Web. A directory, on the other hand, depends on humans for its listings, i.e., a person submits a short description for an entire site or editors write a description for sites they review. The present invention is particularly suited (although not necessarily limited) for use in a search engine of the type that gathers information automatically, i.e., by “crawling” the Web.
Search engines typically include a “crawler” (also called a “spider” or “bot”) that visits a Web page, reads it, and then follows links to other pages within the site. The crawler returns to the site on a regular basis to look for changes. Everything the crawler finds goes into an index, which is another part of the search engine. The index may be viewed as a file or container holding a copy of every Web page that the crawler finds. The primary purpose of the index is to provide a way to quickly look up a document URL based on words specified in a query. If a Web page changes, then the index is updated with new information. The search engine software, which is yet another part of the search engine, is a program that sifts through the pages recorded in the index to find documents fulfilling a search query submitted by a user. The search engine software will typically rank the matches in accordance with their relevance.
Once it is given a set of start addresses and restriction rules, a crawler can retrieve documents following all recursive links from the documents that correspond to the start addresses that pass the restriction rules. For example, a crawler may recursively follow all links from the documents that correspond to specified start addresses, limiting the URL space by filtering out those that do not pass the specified crawl restriction rules. The primary application of the crawler is to build an index of a set of documents, so that the index can be searched by end-users that want to locate documents that match certain search criteria.
A crawler can retrieve documents from different stores. Although the primary store is the Web, a crawler can retrieve documents from a mail store, database, or anything else that has textual content (but textual content is relevant only for processing of a document for the purpose of indexing, since the crawler itself is not concerned with about what type of document is being crawled).
Crawls typically are performed periodically to update the indexes with changed documents. Crawlers usually have no knowledge of the document store specifics. The only thing they can rely on is the last modified timestamp of the document, which is standard for most document stores, including HTTP servers, file servers, mail servers and databases. A problem with this approach is that, to ascertain the increment of the document set, the crawler must ask the corresponding server for each document whether the document's timestamp has changed. Since the percentage of documents that are unchanged between crawls is typically very high, it would be beneficial to minimize the number of requests the crawler makes to the document server to obtain the “increment” of the document set relative to the set of documents received during the previous crawl (i.e., to obtain new, modified and deleted documents). The present invention achieves this goal.
Further background information about Web crawlers is provided below, and may also be found in U.S. pending patent application Ser. No. 09/105,758, filed Jun. 26, 1998, “Method of Web Crawling Utilizing Crawl Numbers,” and U.S. patent application Ser. No. 09/107,227, filed Jun. 30, 1998, and now U.S. Pat. No. 6,483,794 “Synchronizing Crawler With Notification Source.”
SUMMARY OF THE INVENTION
This invention provides an improved mechanism for maintaining a document store in a manner that facilitates an efficient determination of whether and how the document store has been “incremented” or modified from a prior state. For example, the invention could be used in a Web crawler application, mail server, directory service, or any system requiring indexing or one-way replication of a document store. The invention is particularly directed to a method and system for identifying documents in a document store that have changed, are new, or have been deleted.
The present invention utilizes a document store's ability to provide extra properties for each document folder. Such extra properties include, e.g., local commit time (LCT), maximum local commit time (MLCT) and deleted documents count (DDC). The crawler keeps track of local commit times per document URL. For folders, the crawler keeps the greater of the LCT and MLCT, as well as the DDC. It also keeps track of which URLs correspond to folders as opposed to documents, and for each URL it keeps the fact that a document was produced by the store that supports these extended properties (LCT, MLCT, and DDC). In an exemplary application of the present invention, a Web crawler creates an index of documents in a document store on a computer network, which may be an intranet, LAN or the Internet. In an initial crawl, the crawler creates a first full index for the document store. The first full crawl is based on a set of predefined “seed” URLs and crawl restrictions, and involves recursively retrieving each folder/document directly or indirectly linked to the seed URLs. In the process of creating the first full index, the crawler creates a History Table containing a list of URLs for each folder and document found in the first full crawl. The History Table also includes a LCT for each document and a DDC and LCT or MLCT for each folder. Flags are als

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and system for incremental web crawling does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and system for incremental web crawling, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for incremental web crawling will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3129700

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.