Sending to a central indexing site meta data or signatures...

Electrical computers and digital processing systems: multicomput – Distributed data processing – Processing agent

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C709S217000, C709S223000, C707S793000

Reexamination Certificate

active

06516337

ABSTRACT:

TECHNICAL FIELD
The present invention relates generally to data distributed within a network, and more particularly to a method and system for generating and updating an index or catalog of object references for data distributed within a network such as the Internet.
BACKGROUND OF THE INVENTION
In the last several years, the Internet has experienced exponential growth in the number of Web sites and corresponding Web pages contained on the Internet. Countless individuals and corporations have established Web sites to market products, promote their firms, provide information on a specific topic, or merely provide access to the family's latest photographs for friends and relatives. This increase in Web sites and the corresponding information has placed vast amounts of information at the fingertips of millions of people throughout the world.
As a result of the rapid growth in Web sites on the Internet, it has become increasingly difficult to locate pertinent information in the sea of information available on the Internet. As will be understood by those skilled in the art, a search engine, such as Inktomi, Excite, Lycos, Infoseek, or FAST, is typically utilized to locate information on the Internet.
FIG. 1
illustrates a conventional search engine
10
including a router
12
that transmits and receives message packets between the Internet and a Web crawler server
14
, index server
16
, and Web server
18
. As understood by those skilled in the art, a Web crawler or spider is a program that roams the Internet, accessing known Web pages, following the links in those pages, and parsing each Web page that is visited to thereby generate index information about each page. The index information from the spider is periodically transferred to the index server
16
to update the central index stored on the index server. The spider returns to each site on a regular basis, such as every several months, and once again visits Web pages at the site and follows links to other pages within the site to find new Web pages for indexing.
The index information generated by the spider is transferred to the index server
16
to update a catalog or central index stored on the index server. The central index is like a giant database containing information about every Web page the spider finds. Each time the spider visits a Web page, the central index is updated so that the central index contains accurate information about each Web page.
The Web server
18
includes search software that processes search requests applied to the search engine
10
. More specifically, the search software searches the millions of records contained in the central index in response to a search query transferred from a user's browser over the Internet and through the router
12
to the Web server
18
. The search software finds matches to the search query and may rank them in terms of relevance according to predefined ranking algorithms, as will be understood by those skilled in the art.
As the number of Web sites increases at an exponential rate, it becomes increasingly difficult for the conventional search engine
10
to maintain an up-to-date central index. This is true because it takes time for the spider to access each Web page, so as the number of Web pages increases it accordingly takes the spider more time to index the Internet. In other words, as more Web pages are added, the spider must visits these new Web pages and add them to the central index. While the spider is busy indexing these new Web pages, it cannot revisit old Web pages and update portions of the central index corresponding to these pages. Thus, portions of the central index become dated, and this problem is only being exacerbated by the rapid addition of web sites on the Internet.
The method of indexing utilized in the conventional search engine
10
has inherent shortcomings in addition to the inability to keep the central index current as the Internet grows. For example, the spider only indexes known Web sites. Typically, the spider starts with a historical list of sites, such as a server list, and follows the list of the most popular sites to find more pages to add to the central index. Thus, unless your Web site is contained in the historical list or is linked to a site in the historical list, your site will not be indexed. While most search engines accept submissions of sites for indexing, even upon such a submission the site may not be indexed in a timely manner if at all. Another shortcoming of the conventional search engine
10
is the necessity to lock records in the central index stored on the index server
16
when these records are being updated, thus making the records inaccessible to search queries being processed by the search program while the records are locked.
Another inherent shortcoming of the method of indexing utilized in the conventional search engine
10
is that only Standard General Markup Language (SGML) information is utilized in generating the central index. In other words, the spider accesses or renders a respective Web page and parses only the SGML information in that Web page in generating the corresponding portion of the central index. As will be understood by those skilled in the art, due to the format of an SGML Web page, certain types of information may not be placed in the SGML document. For example, conceptual information such as the intended audience's demographics and geographic information may not be placed in an assigned tag in the SGML document. One skilled in the art will appreciate that such information would be extremely helpful in generating a more accurate index. For example, a person might want to search in a specific geographical area, or within a certain industry. By way of example, assume a person is searching for a red barn manufacturer in a specific geographic area. Because SGML pages have no standard tags for identifying industry type or geographical area, the spider on the server
14
in the conventional search engine
10
does not have such information to utilize in generating the central index. As a result, the conventional search engine
10
would typically list not only manufacturers but would also list the location of picturesque red barns in New England that are of no interest to the searcher.
There are four methods currently used to update centrally stored data or a central database from remotely stored data: 1) all of the remotely stored data can be copied over the network to the central location, 2) only those files or objects that have changed are copied to the central location, 3) a transaction log can be kept at the remote location and transmitted to the central location and used to update the central location's copy of the data or database, and 4) a differential can be created by comparing the remotely stored historic copy and the current remotely stored copy, this differential can then be sent to the central location and incorporated into the centrally stored historic copy of the data to create a copy of the current remotely stored copy. All of these methods rely on duplicating the remote data when in many cases the only thing needed is a reference or a link to the remote data.
Some Internet search engines, such as Infoseek, have proposed a distributed search engine approach to assist their spidering programs in finding and indexing new web pages. Infoseek has proposed that each web site on the Internet create a local file named “robots
1
.txt” containing a list of all files on the web site that have been modified within the last twenty-four hours. A spidering program would then download this file and from the file determine which pages on the web site should be accessed and reindexed. Files that have not been modified will not be indexed, saving bandwidth on the Internet otherwise consumed by the spidering program and thus increasing the efficiency of the spidering program. Additional local files could also be created, indicating files that had changed in the last seven days or thirty days or containing a list of all files on the site that are indexable. Under th

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Sending to a central indexing site meta data or signatures... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Sending to a central indexing site meta data or signatures..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Sending to a central indexing site meta data or signatures... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3175017

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.