Reliability of duplicate document detection algorithms

Data processing: database and file management or data structures – Data integrity – Using checksum

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Reliability of duplicate document detection algorithms Reliability of duplicate document detection algorithms

: 2011-07-19
: 2011-07-19
: Mofiz, Apu M (Department: 2161)
: Data processing: database and file management or data structures
: Data integrity
: Using checksum

: C707S692000, C707S741000, C707S742000, C707S747000, C707S727000, C707S728000, C707S729000, C707S730000
: Reexamination Certificate
: active
: 07984029
: ABSTRACT:
In a single-signature duplicate document system, a secondary set of attributes is used in addition to a primary set of attributes so as to improve the precision of the system. When the projection of a document onto the primary set of attributes is below a threshold, then a secondary set of attributes is used to supplement the primary lexicon so that the projection is above the threshold.

REFERENCES:
patent: 5619709 (1997-04-01), Caid
patent: 6621930 (2003-09-01), Smadja
patent: 6658423 (2003-12-01), Pugh et al.
patent: 2003/0221166 (2003-11-01), Farahat
patent: 2005/0060643 (2005-03-01), Glass et al.
patent: 2006/0294077 (2006-12-01), Bluhm et al.
“Online Duplicate Document Detection Signature Reliability in a Dynamic Retrieval Environment”, Conrad et al. Copyright 2003 ACM.
Application filed Dec. 21, 2004 (U.S. Appl. No. 11/016,928).
Application filed Dec. 21, 2004 (U.S. Appl. No. 11/016,930).
Office Action dated May 24, 2007 (U.S. Appl. No. 11/016,930).
Office Action dated Sep. 2, 2008 (U.S. Appl. No. 11/016,930).
Androutsopoulos et al., An Evaluation of Naive Bayesian Anti-Spam Filtering, Proceedings of the Workshop on Machine Learning in the New Information Age: 11th European Conference on Machine Learning (ECML 2000), G. Potamias, V. Moustakis, and M. van Someren, eds., 2000, pp. 9-17.
Bilenko et al., Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases, Tech. Rep. A1 02-296, Artificial Intelligence Lab, University of Texas at Austin, 2002, pp. 1-19.
Breiman, Bagging Predictors, Machine Learning, 24 (1996), pp. 123-140.
Brin et al., Detection Mechanisms for Digital Documents, Proceeding of SIGMOD, 1995, pp. 398-409.
Broder, on the Resemblance and Containment of Documents, SEQS: Sequences '97, 1998, pp. 21-29.
Broder et al., Syntactic Clustering of the Web, Computer Networks and ISDN Systems 29, 1997, pp. 1157-1166.
Buckley et al., The Smart/Empire Tipster IR System, Proceedings—Tipster Text Program Phase III, 2000, pp. 107-121.
Chowdhury et al., Collection Statistics for Fast Duplicate Document Detection, ACM Transactions on Information Systems, 20 (2002), pp. 171-191.
Cooper et al., A Novel Method for Detecting Similar Documents, Proceedings of the 35th Hawaii International Conference on System Sciences, 2002.
Graham-Cummings, The Spammers' Compendium, Proceedings of the Spam Conference, Jan. 17, 2003, pp. 1-17.
Gionis et al., Similarity Search in High Dimensions Via Hashing, Proceedings of the 25th International Conference on Very Large Databases (VLDB), 1999, pp. 518-529.
Fetterly et al., On the Evolution of Clusters of Near-Duplicate Web Pages, Proceedings of the First Latin American Web Congress, 2003, pp. 37-45.
Fawcett, “In Vivo” Spam Filtering: A Challenge Problem for KDD, SIGKDD Explorations, vol. 5, Issue 2, (2003), pp. 140-148.
Drucker et al., Support Vector Machines for Spam Categorization, IEEE Transactions on Neural Networks, vol. 10, No. 5, Sep. 1999, pp. 1048-1054.
Hall, A Countermeasure to Duplicate-Detecting Anti-Spam Techniques, AT&T Labs Technical Report 99.9.1, AT&T Corp., 1999, pp. 1-26.
Haveliwala et al., Scalable Techniques for Clustering the Web, Proceedings of WebDB 2000, 2000.
Heintze, Scalable Document Fingerprinting, The USENIX Association, Proceedings of the Second USENIX Workshop on Electronic Commerce, Nov. 1996, pp. 191-200.
Hernandez et al., The Merge/Purge Problem for Large Databases, Proceedings of the SIGMOD Conference, 1995, pp. 127-138.
Hoad et al., Methods for Identifying, Versioned and Plagiarised Documents, Journal of the American Society for Information Science and Technology, 2002, pp. 203-215.
Ilyinsky et al., An Efficient Method to Detect Duplicates of Web Documents With the Use of Inverted Index, Proceedings of the Eleventh International World Wide Web Conference, 2002.
Kleinberg, Bursty and Hierarchical Structure in Streams, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), 2002, pp. 1-25.
Kolcz et al., Data Duplication: An Imbalance Problem 2, Proceedings of the ICML '2003 Workshop on Learning from Imbalanced Datasets (11), 2003.
Kolcz et al., SVM-Based Filtering of E-Mail Spam With Content-Specific Misclassification Costs, Proceedings of the Workshop on Text Mining (TextDM'2001), 2001, pp. 1-14.
Kwok, A New Method of Weighting Query Terms for AD-HOC Retrieval, Computer Science Department, Queens College, City University of New York, Flushing NY.
McCallum et al., Efficient Clustering of High-Dimensional Data Sets With Application to Reference Matching, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), 2000.
Robertson et al., Okapi At Trec-7: Automatic AD HOC, Filtering, VLC and Interactive, Proceedings of the 7th Text Retrieval Conference, 1998, pp. 253-264.
Sahami et al., A Bayesian Approach to Filtering Junk E-Mail, Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, 1998.
Salton et al., A Vector-Space Model for Information Retrieval, Communications of the ACM, vol. 18, No. 11, Nov. 1975, 613-620.
Sanderson et al., Duplicate Detection in, The Reuters Collection, Tech. Rep. TR-1997-5, Department of Computing Science, University of Glasgow, 1997, pp. 11.
Shivakumar et al., Finding Near-Replicas of Documents on the Web, WEBDB: International Workshop on the World Wide Web and Databases, WebDB, LNCS, 1999.
Singhal et al., Pivoted Document Length Normalization, Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996.
Winkler et al., The State of Record Linkage and Current Research Problems, Tech. Rep., Statistical Research Division, U.S. Bureau of Census, Washington, DC, 1999.
Androutsopoulos et al., Learning to Filter Unsolicited Commercial E-Mail. Technical Report Feb. 2004, NCSR Demokritos, 2004, pp. 1-52.
Baker et al., Distributional Clustering of Words for Text Classification, Proceedings of SIGIR-98, 21st ACM international Conference on Research and Development in Information Retrieval, 1998, pp. 96-103.
Carreras et al., Boosting Trees for Anti-Spam Email Filtering, Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 2001.
Slonim et al., The Power of Word Clusters for Text Classification, 23rd European Colloquium on Information Retrieval Research, 2001, pp. 1-12.
Yerazunis, Sparse Binary Polynomial Hashing and the CRM114 Discriminator, MIT Spam Conference, 2003.
Zhou et al., Approximate Object Location and Spam Filtering on Peer-To-Peer Systems, Proceedings of ACM/IFIP/USENIX International Middleware Conference (Middleware 2003), 2003, pp. 1-20.
Office Action dated Sep. 2, 2008 (U.S. Appl. No. 11/016,928).

Affiliated with

Alspector Joshua

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Chowdhury Abdur R.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Kolcz Aleksander

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

AOL Inc.

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

Finnegan Henderson Farabow Garrett & Dunner L.L.P.

Law Firm

[ 0.00 ] – not rated yet Voters 0 Comments 0

Mofiz Apu M

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Nguyen Cindy

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Reliability of duplicate document detection algorithms does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Reliability of duplicate document detection algorithms, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Reliability of duplicate document detection algorithms will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-2672815

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure