Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1997-12-19
2001-07-31
Homere, Jean R. (Department: 2177)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06269362
ABSTRACT:
FIELD OF THE INVENTION
This invention relates generally to clipping services, and more particularly to automatically monitoring electronically stored documents using queries.
BACKGROUND OF THE INVENTION
For many organizations and institutions, it is common to use a clipping service to monitor topics of interest in conventional print media. For example, companies often employ a clipping service to monitor what the print media is publishing about a company or its products.
More recently, clipping services have started to monitor electronic media as well. In a simple semi-automated monitoring system, queries that define what is to be monitored are periodically submitted to one or more Web search engines. In order to get a good “recall,” the queries may be constructed to retrieve as many relevant pages as possible.
One widely used electronic publishing media is the Internet's World-Wide-Web (the “Web”). A service eWatch offers to monitor documents retrievals, please see, “http://www.ewatch.com.” The eWatch service claims to monitor some 40,000 public bulletin boards and preselected Web sites for some four-hundred of the world's largest corporations. There, a key first step is to identify which sites are relevant to a particular client. Because Web pages at the selected sites are retrieved on a daily basis to check whether anything has changed or not, this could become quite expensive when the number of monitored sites is large.
Dartmouth University offers a Web clipping service called the Informant at “http://informant.dartmouth.edu/.” This free service only monitors the top ten relevant pages for a particular query plus any Web pages at a preselected set (a maximum of 35 pages per user) of Universal Resource Locators (URL). The service computes a hash value for each current page being monitored, and compares the hash value with the hash value of a previous version of the page. If the hash values are different, the content of the Web page has probably changed. The service is limited in the number of pages that are monitored, and even trivial changes to a Web page will change the hash value so that the Web page is flagged as “interesting.”
In general, monitoring pre-selected sites is relatively easy, however, monitoring the entire Web, or even a large portion of the Web is a much more difficult problem. The number of Web sites is easily counted in the millions, with a large proportion of those sites having pages that change on a frequent basis. Active Web “publishers” may change pages on a daily basis, in many cases trivially.
Therefore, the output from the search engine can be quite large. Because humans will eventually have to read and analyze the output it is desirable to mechanically filter the output as much as possible. In particular it is necessary to eliminate pages that have not changed or have not substantially changed since the last retrieval.
SUMMARY OF THE INVENTION
Provided is a computerized method for monitoring the content of documents. A set of documents is stored in memories of server computers. The server computers can be connected to each other by a network such as the Internet.
Entries are generated in a search engine for each document of the set. The search engine is also connected to the Internet. The entries are in the form of a full word index of the set of documents. The search engine also maintains a first abstract for each document that is indexed. The abstract is highly dependent on the content of each document. For example, the abstract is in the form of a sketch or a feature vector.
Periodically a query is submitted to the search engine. The query locates a result set of documents that satisfy the query. A second abstract is generated for each document member of the result set. The first and second abstracts are compared to identify documents that have changed between the time the set of documents were indexed and the time the result set is generated.
REFERENCES:
patent: 3947825 (1976-03-01), Cassada
patent: 5649186 (1997-07-01), Ferguson
patent: 5715441 (1998-02-01), Atkinson et al.
patent: 5758358 (1998-05-01), Ebbo
patent: 5774845 (1998-06-01), Ando et al.
patent: 5787424 (1998-07-01), Hill et al.
patent: 5806078 (1998-09-01), Hug et al.
patent: 5832474 (1998-11-01), Lopresti et al.
patent: 5835087 (1998-11-01), Herz et al.
patent: 5835905 (1998-11-01), Pirolli et al.
patent: 5860071 (1999-01-01), Ball et al.
patent: 5898836 (1999-04-01), Freivald
patent: 5905979 (1999-05-01), Barrows
patent: 5933604 (1999-08-01), Inakoshi
patent: 6012083 (2000-01-01), Savitzky et al.
patent: 6029175 (2000-02-01), Chow wt al.
patent: 6067541 (2000-05-01), Raju et al.
patent: 6092091 (2000-07-01), Sumita et al.
Broder, Some applications of Rabin's fingerprinting method, Sequences II: Methods in Communications, Security, and Computer Science, Springer-Verlag, pp. 1-10, 1993.
Carter et al., Universal Classes of Hash Functions, Journal of Computer and System Sciences 18, pp. 143-154, 1979.
Rabin, Probablistic Algorithms in Finite Fields, SIAM J. Comput., vol. 9, No. 2, pp. 273-280, 1980.
Broder Andrei Zary
Glassman Steven Charles
Manasse Mark Steven
Alta Vista Company
Fenwick & West LLP
Homere Jean R.
Robinson Greta L
LandOfFree
System and method for monitoring web pages by comparing... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for monitoring web pages by comparing..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for monitoring web pages by comparing... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2566175