Duplicate document detection in a web crawler system

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

07627613

ABSTRACT:
Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

REFERENCES:
patent: 6119124 (2000-09-01), Broder et al.
patent: 6278992 (2001-08-01), Curtis et al.
patent: 6292880 (2001-09-01), Mattis et al.
patent: 6547829 (2003-04-01), Meyerzon et al.
patent: 6631369 (2003-10-01), Meyerzon et al.
patent: 6675159 (2004-01-01), Lin et al.
patent: 6687696 (2004-02-01), Hofmann et al.
patent: 6711568 (2004-03-01), Bharat et al.
patent: 6847967 (2005-01-01), Takano
patent: 6947930 (2005-09-01), Anick et al.
patent: 6976207 (2005-12-01), Rujan et al.
patent: 6978419 (2005-12-01), Kantrowitz
patent: 7080073 (2006-07-01), Jiang et al.
patent: 2002/0038350 (2002-03-01), Lambert et al.
patent: 2002/0103809 (2002-08-01), Starzl et al.
patent: 2002/0138509 (2002-09-01), Burrows et al.
patent: 2003/0014399 (2003-01-01), Hansen et al.
patent: 2003/0130994 (2003-07-01), Singh et al.
patent: 2003/0195883 (2003-10-01), Mojsilovic et al.
patent: 2004/0210575 (2004-10-01), Bean et al.
patent: 2005/0027685 (2005-02-01), Kamvar et al.
“Web search services”, Wang et al., University of Science and Technology, Hong Kong, Issue Date: 2002, Series/Report No. Computer Science Technical Report, HKUST-CS02-26.
“Structure of the Internet?”, Tsoi, Faculty of Informatics Papers, University of Wollongong, 2001.
“Extending SDARTS: extracting metadata from web databases and interfacing with the open archives initiative”, by Ipeirotis et al, Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries table of contents, Portland, Oregon, pp. 162-170, 2002.
“Evaluating document clustering for interactive information retrieval”, by Leuski, Proceedings of the tenth international conference on Information and knowledge management, Atlanta, Georgia, pp. 33-40, 2001, ISBN:1-58113-436-3.
Kelly, T. and Mogul, J., “Aliasing on the World Wide Web: Prevalence and Performance Implications,” Proceedings of the 11thInternational World Wide Web Conference, May 2002.
Smith, B. et al., “Exploiting result equivalence in caching dynamic web content,” USENIX Symposium on Internet Technology and Systems, Boulder, Colorado, USA, Oct. 1999. USENIX Association.
Henzinger, M. et al., “Challenges in Web Search Engines,” Internet Mathematics, vol. 1, No. 1: 115-126, 2002.
Brin, S. et al., “Copy detection mechanisms for digital documents,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 398-409, 1995.
Broder, A.Z., “On resemblance and containment of documents,” Proceedings of Compression and Complexity of Sequences, IEEE Computer Society, pp. 21-29, 1997.
Cho, J., et al., “Finding replicated web collections,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 355-366, 2000.
Kleinberg, J., “Authoritative sources in a hyperlinked environment,” Journal of the ACM, vol. 46, No. 5, Sep. 1999, pp. 604-632.
Bharat, K. and Broder, A., “Mirror, mirror on the Web: A Study of host pairs with replicated content,” Proceedings of the 8thWWW Conf., May 1999.
Bharat, K. et al., “A comparison of techniques to find mirrored hosts on the WWW,” Proceedings Workshop on Organizing Web Space at 4thACM Conference on Digitial Libraries, Aug. 1999.
Shivakumar, N. and Garcia-Molina, H., “Finding near-replicas of documents on the web,” in World Wide Web and Databases, International Workshop WebDB'98, Valencia, Spain, pp. 204-212, Mar. 1998.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Duplicate document detection in a web crawler system does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Duplicate document detection in a web crawler system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Duplicate document detection in a web crawler system will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-4061689

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.