Method for clustering closely resembling data objects

Data processing: database and file management or data structures – Database design – Data structure types

Patent

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

707 3, 707 2, 707 5, G06F 1730

Patent

active

061191248

ABSTRACT:
A computer-implemented method determines the resemblance of data objects such as Web pages. Each data object is partitioned into a sequence of tokens. The tokens are grouped into overlapping sets of the tokens to form shingles. Each shingle is represented by a unique identification element encoded as a fingerprint. A minimum element from each of the images of the set of fingerprints associated with a document under each of a plurality of pseudo random permutations of the set of all fingerprints are selected to generate a sketch of each data object. The sketches characterize the resemblance of the data objects. The sketches can be further partitioned into a plurality of groups. Each group is fingerprinted to form a feature. Data objects that share more than a certain numbers of features are estimated to be nearly identical.

REFERENCES:
patent: 5488725 (1996-01-01), Turtle
patent: 5675819 (1997-10-01), Schuetze
patent: 5724571 (1998-03-01), Woods
patent: 5819258 (1998-10-01), Vaithyanathan
patent: 5857179 (1999-01-01), Vaithyanathan
patent: 5909677 (1999-06-01), Broder
patent: 5937084 (1999-08-01), Crabtree
Brin et al.; Copy Detection Mechanisms for Digital Documents; Department of Computer Science; www.db.stanford.edu/.about.sergey/copy.html.
Broder; Some Applications of Rabin's fingerprinting method; Methods in Communications, Security, and Computer Science; pp. 1-10; 1993.
Carter et al.; Universal Classes of Hash Functions; Journal of Computer and System Science; vol. 18; pp. 143-154; 1979.
Heintze et al.; Scalable Document Fingerprinting (Extended Abstract) found @ www.cs.cmu.edu/afs/cs/user
ch/www/koala/main.htm on Sep. 1997.
Karp et al.; The Bit Vector Intersection Problem; Proceedings 36.sup.th Annual Symposium of Computer Science, IEEE Computer Society Press, Oct. 23-25, 1995; pp. 621-634.
Shivakumar et al; Building a Scalable and Accurate Copy Detection Mechanism; Proceedings of 1.sup.st ACM Conference on Digital Libraries (DL'96), 1996.
Shivakumar et al; SCAM: A Copy Detection Mechanism for Digital Documents; Proceedings of 2.sup.ND International Conference in Theory and Practice of Digital Libraries; 1995.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for clustering closely resembling data objects does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for clustering closely resembling data objects, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for clustering closely resembling data objects will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-105234

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.