Decreasing the fragility of duplicate document detecting...

Electrical computers and digital processing systems: support – Multiple computer communication using cryptography – Particular communication authentication technique

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C713S180000, C380S059000, C715S752000, C726S022000, C726S026000

Reexamination Certificate

active

07624274

ABSTRACT:
In a signature-based duplicate detection system, multiple different lexicons are used to generate a signature for a document that comprises multiple sub-signatures. The signature of an e-mail or other document may be defined as the set of signatures generated based on the multiple different lexicons. When a collection of sub-signatures is used as a document's signature, two documents may be considered as being duplicates when a sub-signature generated based on a particular lexicon in the collection for the first document matches a signature generated based on the same lexicon in the collection for the second document.

REFERENCES:
patent: 5619709 (1997-04-01), Caid
patent: 6621930 (2003-09-01), Smadja
patent: 6658423 (2003-12-01), Pugh et al.
patent: 2003/0221166 (2003-11-01), Farahat
patent: 2006/0294077 (2006-12-01), Bluhm
Application filed Dec. 21, 2004 (U.S. Appl. No. 11/016,930).
Office Action dated May 24, 2007 (U.S. Appl. No. 11/016,930).
Application filed Dec. 21, 2004 (U.S. Appl. No. 11/016,959).
Office Action dated May 31, 2007 (U.S. Appl. No. 11/016,959).
Androutsopoulos et al., An Evaluation of Naive Bayesian Anti-Spam Filtering, Proceedings of the Workshop on Machine Learning in the New Information Age: 11th European Conference on Machine Learning (ECML 2000), G. Potamias, V. Moustakis, and M. van Someren, eds., 2000, pp. 9-17.
Bilenko et al., Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases, Tech. Rep. A1 02-296, Artificial Intelligence Lab, University of Texas at Austin, 2002, pp. 1-19.
Breiman, Bagging Predictors, Machine Learning, 24 (1996), pp. 123-140.
Brin et al., Copy Detection Mechanisms for Digital Documents, Proceeding of SIGMOD, 1995, pp. 398-409.
Broder, On the Resemblance and Containment of Documents, SEQS: Sequences '97, 1998, pp. 21-29.
Broder et al., Syntactic Clustering of the Web, Computer Networks and ISDN Systems 29, 1997, pp. 1157-1166.
Buckley et al., The Smart/Empire Tipster IR System, Proceedings—Tipster Text Program Phase III, 2000, pp. 107-121.
Chowdhury et al., Collection Statistics for Fast Duplicate Document Detection, ACM Transactions on Information Systems, 20 (2002), pp. 171-191.
Cooper et al., A Novel Method for Detecting Similar Documents, Proceedings of the 35th Hawaii International Conference on System Sciences, 2002.
Graham-Cummings, The Spammers' Compendium, Proceedings of the Spam Conference, Jan. 17, 2003, pp. 1-17.
Gionis et al., Similarity Search in High Dimensions via Hashing, Proceedings of the 25th International Conference on Very Large Databases (VLDB), 1999, pp. 518-529.
Fetterly et al., On the Evolution of Clusters of Near-Duplicate Web Pages, Proceedings of the First Latin American Web Congress, 2003, pp. 37-45.
Fawcett, “In vivo” Spam Filtering: a Challenge Problem for KDD, SIGKDD Explorations, vol. 5, Issue 2, (2003), pp. 140-148.
Drucker et al., Support Vector Machines for Spam Categorization, IEEE Transactions on Neural Networks, vol. 10, No. 5, Sep. 1999, pp. 1048-1054.
Hall, A Countermeasure to Duplicate-Detecting Anti-Spam Techniques, AT&T Labs Technical Report 99.9.1, AT&T Corp., 1999, pp. 1-26.
Haveliwala et al., Scalable Techniques for Clustering the Web, Proceedings of WebDB 2000, 2000.
Heintze, Scalable Document Fingerprinting, The USENIX Association, Proceedings of the Second USENIX Workshop on Electronic Commerce, Nov. 1996, pp. 191-200.
Hernandez et al., The Merge/Purge Problem for Large Databases, Proceedings of the SIGMOD Conference, 1995, pp. 127-138.
Hoad et al., Methods for Identifying, Versioned and Plagiarised Documents, Journal of the American Society for Information Science and Technology, 2002, pp. 203-215.
Ilyinsky et al., An Efficient Method to Detect Duplicates of Web Documents With the Use of Inverted Index, Proceedings of the Eleventh International World Wide Web Conference, 2002.
Kleinberg, Bursty and Hierarchical Structure in Streams, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), 2002, pp. 1-25.
Kolcz et al., Data Duplication: An Imbalance Problem 2, Proceedings of the ICML '2003 Workshop on Learning from Imbalanced Datasets (11), 2003.
Kolcz et al., SVM-Based Filtering of E-Mail Spam With Content-Specific Misclassification Costs, Proceedings of the Workshop on Text Mining (TextDM'2001), 2001, pp. 1-14.
Kwok, A New Method of Weighting Query Terms for Ad-Hoc Retrieval, Computer Science Department, Queens College, City University of New York, Flushing NY.
McCallum et al., Efficient Clustering of High-Dimensional Data Sets With Application to Reference Matching, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), 2000.
Robertson et al., Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive, Proceedings of the 7th Text Retrieval Conference, 1998, pp. 253-264.
Sahami et al., A Bayesian Approach to Filtering Junk E-Mail, Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, 1998.
Salton et al., A Vector-Space Model for Information Retrieval, Communications of the ACM, vol. 18, No. 11, Nov. 1975, 613-620.
Sanderson et al., Duplicate Detection in, the Reuters Collection, Tech. Rep. TR-1997-5, Department of Computing Science, University of Glasgow, 1997, pp. 11.
Shivakumar et al., Finding Near-Replicas of Documents on the Web, WEBDB: International Workshop on the World Wide Web and Databases, WebDB, LNCS, 1999.
Singhal et al., Pivoted Document Length Normalization, Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996.
Winkler et al., The State of Record Linkage and Current Research Problems, Tech. Rep., Statistical Research Division, U.S. Bureau of Census, Washington, DC, 1999.
Androutsopoulos et al., Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/2, NCSR Demokritos, 2004, pp. 1-52.
Baker et al., Distributional Clustering of Words for Text Classification, Proceedings of SIGIR-98, 21st ACM international Conference on Research und Development in Information Retrieval, 1998, pp. 96-103.
Carreras et al., Boosting Trees for Anti-Spam Email Filtering, Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 2001.
Slonim et al., The Power of Word Clusters for Text Classification, 23rd European Colloquium on Information Retrieval Research, 2001, pp. 1-12.
Yerazunis, Sparse Binary Polynomial Hashing and the CRM114 Discriminator, MIT Spam Conference, 2003.
Zhou et al., Approximate Object Location and Spam Filtering on Peer-to-Peer Systems, Proceedings of ACM/IFIP/USENIX International Middleware Conference (Middleware 2003), 2003, pp. 1-20.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Decreasing the fragility of duplicate document detecting... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Decreasing the fragility of duplicate document detecting..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Decreasing the fragility of duplicate document detecting... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-4093313

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.