Representative document selection for sets of duplicate...

Data processing: database and file management or data structures – Database and file access – Preparing data for information retrieval

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S737000, C707S758000

Reexamination Certificate

active

07984054

ABSTRACT:
Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

REFERENCES:
patent: 6119124 (2000-09-01), Broder et al.
patent: 6278992 (2001-08-01), Curtis et al.
patent: 6292880 (2001-09-01), Mattis et al.
patent: 6547829 (2003-04-01), Meyerzon et al.
patent: 6631369 (2003-10-01), Meyerzon et al.
patent: 6675159 (2004-01-01), Lin et al.
patent: 6687696 (2004-02-01), Hofmann et al.
patent: 6711568 (2004-03-01), Bharat et al.
patent: 6847967 (2005-01-01), Takano
patent: 6947930 (2005-09-01), Anick et al.
patent: 6976207 (2005-12-01), Rujan et al.
patent: 6978419 (2005-12-01), Kantrowitz
patent: 7080073 (2006-07-01), Jiang et al.
patent: 2002/0038350 (2002-03-01), Lambert et al.
patent: 2002/0103809 (2002-08-01), Starzl et al.
patent: 2002/0138509 (2002-09-01), Burrows et al.
patent: 2003/0014399 (2003-01-01), Hansen et al.
patent: 2003/0130994 (2003-07-01), Singh et al.
patent: 2003/0195883 (2003-10-01), Mojsilovic et al.
patent: 2004/0210575 (2004-10-01), Bean et al.
patent: 2005/0027685 (2005-02-01), Kamvar et al.
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the 8th ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, pp. 269-278, Edmonton, Canada, Jul. 2002.
A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe. Collection Statistics for Fast Duplicate Document Detection, ACM Transactions on Information Systems. vol. 20, No. 2, pp. 171-191, Apr. 2002.
Bharat, K., et al., “Mirror, mirror on the Web: A Study of host pairs with replicated content,” Proceedings of the 8thWWW Conf., May 1999.
Bharat, K. et al., “A comparison of techniques to find mirrored hosts on the WWW,” Proceedings Workshop on Organizing Web Space at 4thACM Conference on Digital Libraries, Aug. 1999.
Brin, S. et al., “Copy detection mechanisms for digital documents,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 398-409, 1995.
Broder, A.Z., “On resemblance and containment of documents,” Proceedings of Compression and Complexity of Sequences, IEEE Computer Society, pp. 21-29, 1997.
Cho, J., et al., “Finding replicated web collections,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 355-366, 2000.
Henzinger, M. et al., “Challenges in Web Search Engines,” Internet Mathematics, vol. 1, No. 1: 115-126, 2002.
Ipeirotis et al., “Extending SDARTS: Extracting metadata from web databases and interfacing with the open archives initiative,” Proceedings of the 2ndACM/IEEE-CS Joint Conference on Digital Libraries Table of Contents, Portland, Oregon, 2002, pp. 162-170.
Kelly, T. et al., “Aliasing on the World Wide Web: Prevalence and Performance Implications,” Proceedings of the 11thInternational World Wide Web Conference, May 2002.
Kleinberg, J., “Authoritative sources in a hyperlinked environment,” Journal of the ACM, vol. 46, No. 5, Sep. 1999, pp. 604-632.
Leuski, “Evaluating Document Clustering for Interactive Information Retrieval,” Proceedings of the 10thInt'l Conference on Information and Knowledge, Atlanta, Georgia, 2001, pp. 33-40.
Shivakumar, N., et al., “Finding near-replicas of documents on the web,” in World Wide Web and Databases, International Workshop WebDB'98, Valencia, Spain, pp. 204-212, Mar. 1998.
Smith, B. et al., “Exploiting result equivalence in caching dynamic web content,” USENIX Symposium on Internet Technology and Systems, Boulder, Colorado, USA, Oct. 1999. USENIX Association.
Tsoi, “Structure of the Internet?” Faculty of Informatics Papers, Univ. of Wollongong, 2001.
Wang, et al., “Web Search Services,” Univ. of Science & Technology, Hong Kong, Issue Date: 2002, Series/Report No. Computer Science Technical Report, HKUST-CS02-26.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Representative document selection for sets of duplicate... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Representative document selection for sets of duplicate..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Representative document selection for sets of duplicate... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2709292

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.