Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2003-07-03
2009-12-01
Vo, Tim T. (Department: 2168)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
07627613
ABSTRACT:
Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.
REFERENCES:
patent: 6119124 (2000-09-01), Broder et al.
patent: 6278992 (2001-08-01), Curtis et al.
patent: 6292880 (2001-09-01), Mattis et al.
patent: 6547829 (2003-04-01), Meyerzon et al.
patent: 6631369 (2003-10-01), Meyerzon et al.
patent: 6675159 (2004-01-01), Lin et al.
patent: 6687696 (2004-02-01), Hofmann et al.
patent: 6711568 (2004-03-01), Bharat et al.
patent: 6847967 (2005-01-01), Takano
patent: 6947930 (2005-09-01), Anick et al.
patent: 6976207 (2005-12-01), Rujan et al.
patent: 6978419 (2005-12-01), Kantrowitz
patent: 7080073 (2006-07-01), Jiang et al.
patent: 2002/0038350 (2002-03-01), Lambert et al.
patent: 2002/0103809 (2002-08-01), Starzl et al.
patent: 2002/0138509 (2002-09-01), Burrows et al.
patent: 2003/0014399 (2003-01-01), Hansen et al.
patent: 2003/0130994 (2003-07-01), Singh et al.
patent: 2003/0195883 (2003-10-01), Mojsilovic et al.
patent: 2004/0210575 (2004-10-01), Bean et al.
patent: 2005/0027685 (2005-02-01), Kamvar et al.
“Web search services”, Wang et al., University of Science and Technology, Hong Kong, Issue Date: 2002, Series/Report No. Computer Science Technical Report, HKUST-CS02-26.
“Structure of the Internet?”, Tsoi, Faculty of Informatics Papers, University of Wollongong, 2001.
“Extending SDARTS: extracting metadata from web databases and interfacing with the open archives initiative”, by Ipeirotis et al, Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries table of contents, Portland, Oregon, pp. 162-170, 2002.
“Evaluating document clustering for interactive information retrieval”, by Leuski, Proceedings of the tenth international conference on Information and knowledge management, Atlanta, Georgia, pp. 33-40, 2001, ISBN:1-58113-436-3.
Kelly, T. and Mogul, J., “Aliasing on the World Wide Web: Prevalence and Performance Implications,” Proceedings of the 11thInternational World Wide Web Conference, May 2002.
Smith, B. et al., “Exploiting result equivalence in caching dynamic web content,” USENIX Symposium on Internet Technology and Systems, Boulder, Colorado, USA, Oct. 1999. USENIX Association.
Henzinger, M. et al., “Challenges in Web Search Engines,” Internet Mathematics, vol. 1, No. 1: 115-126, 2002.
Brin, S. et al., “Copy detection mechanisms for digital documents,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 398-409, 1995.
Broder, A.Z., “On resemblance and containment of documents,” Proceedings of Compression and Complexity of Sequences, IEEE Computer Society, pp. 21-29, 1997.
Cho, J., et al., “Finding replicated web collections,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 355-366, 2000.
Kleinberg, J., “Authoritative sources in a hyperlinked environment,” Journal of the ACM, vol. 46, No. 5, Sep. 1999, pp. 604-632.
Bharat, K. and Broder, A., “Mirror, mirror on the Web: A Study of host pairs with replicated content,” Proceedings of the 8thWWW Conf., May 1999.
Bharat, K. et al., “A comparison of techniques to find mirrored hosts on the WWW,” Proceedings Workshop on Organizing Web Space at 4thACM Conference on Digitial Libraries, Aug. 1999.
Shivakumar, N. and Garcia-Molina, H., “Finding near-replicas of documents on the web,” in World Wide Web and Databases, International Workshop WebDB'98, Valencia, Spain, pp. 204-212, Mar. 1998.
Dean Jeffrey A.
Dulitz Daniel
Ghemawat Sanjay
Verstak Alexandre A.
Google Inc.
Morgan & Lewis & Bockius, LLP
Morrison Jay A
Vo Tim T.
LandOfFree
Duplicate document detection in a web crawler system does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Duplicate document detection in a web crawler system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Duplicate document detection in a web crawler system will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-4061689