System and method for near and exact de-duplication of...

Data processing: database and file management or data structures – Database and file access – Preparing data for information retrieval

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S692000, C707S664000

Reexamination Certificate

active

07930306

ABSTRACT:
A system, method and computer program product for identifying near and exact-duplicate documents in a document collection, including for each document in the collection, reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy. Based on the values of N and M near and exact-duplicate documents are identified in the document collection.

REFERENCES:
patent: 5600835 (1997-02-01), Garland et al.
patent: 6137911 (2000-10-01), Zhilyaev
patent: 6230155 (2001-05-01), Broder et al.
patent: 6658423 (2003-12-01), Pugh et al.
patent: 7660819 (2010-02-01), Frieder et al.
patent: 7734627 (2010-06-01), Tong
patent: 7779002 (2010-08-01), Gomes et al.
patent: 2001/0047365 (2001-11-01), Yonaitis
patent: 2008/0044016 (2008-02-01), Henzinger
patent: 2008/0263026 (2008-10-01), Sasturkar et al.
Cooper, James W., et al., “Detecting Similar Documents Using Salient Terms” Nov. 4-9, 2002, ACM, p. 1-7.
Koberstein, Jonathan, et al., “Using Word Clusters to Detect Similar Web Documents” 2006, Springer-Verlag, p. 215-228.
Hammouda, Khaled M., “Web Mining: Clustering Web Documents A Preliminary Review” Feb. 26, 2001, University of Waterloo, p. 1-13.
Wong, Wai-chiu, et al., “Incremental Document Clustering for Web Page Classification” Jul. 1, 2000, The Chinese University of Hong Kong, p. 0-20.
The copyright deposit No. TX 6-320-844 “1-ZYIMAGE. 5.0.” ZyLAB Technologies BV, Amsterdam, NL, 1983-2005. 20 pages.
The copyright deposit No. Txu-534-683 “Zy4 search module main module”, Information Dimensions, Incorporated, Oct. 3, 1991.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for near and exact de-duplication of... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for near and exact de-duplication of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for near and exact de-duplication of... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2646927

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.