Method for determining the resemining the resemblance of...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06230155

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to the field of comparing data files residing on one or more computer systems, and more particularly to the field of determining the resemblance of documents.
BACKGROUND OF THE INVENTION
As is known in the art, computer users create and store data files as documents in computer systems. As is also known, these same computer users, for a variety of reasons, are often interested in determining the similarity of two documents.
One approach, for example, is to record samples of each document, and to declare documents to be similar if they have many samples in common. The samples could be sequences of fixed numbers of any convenient units, such as English words. Such a method requires samples proportional in size with the length of the documents.
Another approach to this problem is based on single word “chunks.” Such a method employs a registration server that maintains registered documents against which new documents can be checked for overlap. The method detects copies based on comparing word frequency occurrences of the new document against those of registered documents.
What is needed is a method to determine whether two documents have the same content except for modifications such as formatting, minor corrections, web-master signature, logo, etc., using small sketches of the document, rather than the full text.
SUMMARY OF THE INVENTION
In accordance with the present invention, a method of determining the resemblance of a plurality of documents stored on a computer network including loading a first document into a random access memory (RAM), loading a second document into the RAM, reducing the first document into a first set of tokens, reducing the second document into a second set of tokens, converting the first sequence of tokens to a first (multi)set of shingles, converting the second sequence of tokens to a second (multi)set of shingles, determining a first fixed size sketch of the first (multi)set of shingles, determining a second fixed size sketch of the second (multi)set of shingle, and comparing the first sketch and the second sketch. With such a method, computation of the resemblance of two documents is provided using a sketch of each document. The sketches may be computed fairly fast and given two sketches the resemblance of the corresponding documents can be computed in linear time in the size of the sketches.


REFERENCES:
patent: 5442780 (1995-08-01), Takanashi et al.
patent: 5544049 (1996-08-01), Henderson et al.
patent: 5577249 (1996-11-01), Califano
patent: 5778363 (1998-07-01), Light
patent: 5909677 (1999-06-01), Broder et al.
Andrei Z. Broder “On the resemblance and containment of documents,” compression and complexity of sequences 1997. Proceedings, Jun. 11-13, 1997, pp. 21-29.*
Chen, Y.-L et al., “Image correspondence based on region Hierarchy,” Conference proceedings., Jun. 16-17, 1991, pp. 328-331, vol. 1.*
Cubero et al., “Weak and Strong resemblance in fuzzy functional dependencies,” IEEE World Congress on computational intelligence., Jun. 26-29, 1994, pp. 162-166, vol. 1.*
Gracia-Solaco et al., “Discovering Interdatabase resemblance of classes for interperable databases,” Research Issues in Data Engineering, Apr. 19-20, 1993, pp. 26-33.*
Bose et al., “On the comparison of Imprecise values in fuzzy database,” IEEE, vol. 2, Jul. 1-5, 1997, pp. 707-712.*
Brin, S. Davis, J. and Garcia-Molina, H., “Copy Detection Mechanisms for Digital Documents,” (Research) Department of Computer Science, Stanford University.
Broder, Andrei Z., “Some applications of Rabin's fingerprinting method” inSequences II: Methods in Communications, Security and Computer Science, R. Capocelli, et al., (eds), (Springer-Verlag) pp 1-10 (1993).

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for determining the resemining the resemblance of... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for determining the resemining the resemblance of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for determining the resemining the resemblance of... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2557581

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.