Classifier tuning based on data similarities

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C709S205000, C709S206000, C709S207000

Reexamination Certificate

active

07089241

ABSTRACT:
A probabilistic classifier is used to classify data items in a data stream. The probabilistic classifier is trained, and an initial classification threshold is set, using unique training and evaluation data sets (i.e., data sets that do not contain duplicate data items). Unique data sets are used for training and in setting the initial classification threshold so as to prevent the classifier from being improperly biased as a result of similarity rates in the training and evaluation data sets that do not reflect similarity rates encountered during operation. During operation, information regarding the actual similarity rates of data items in the data stream is obtained and used to adjust the classification threshold such that misclassification costs are minimized given the actual similarity rates.

REFERENCES:
patent: 6161130 (2000-12-01), Horvitz et al.
patent: 6199103 (2001-03-01), Sakaguchi et al.
patent: 6330590 (2001-12-01), Cotten
patent: 6421709 (2002-07-01), McCormick et al.
patent: 6507866 (2003-01-01), Barchi
patent: 2002/0116463 (2002-08-01), Hart
patent: 2002/0116641 (2002-08-01), Mastrianni
patent: 2002/0147754 (2002-10-01), Dempsey et al.
patent: 2002/0181703 (2002-12-01), Logan et al.
patent: 2002/0199095 (2002-12-01), Bandini et al.
H. Zaragoza et al., “Machine Learning and Textual Information Access”, 4thEuropean Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2000), Lyon, France, Sep. 2000, pp. 1-12.
S. Hird, Technical Solutions for controlling Spam, In the proceedings of AUG2000, Melbourne, Sep. 4-6, 2002.
M. Marvin, Announce: Implementation of E-mail Spam Proposal, news.admin.net-abuse.misc., Aug. 3, 1996.
H. Drucker et al., “Support Vector Machines for Spam Categorization,” IEEE Transactions on Neural Networks, vol. 10, No. 5, Sep. 1999.
M. Hearst et al., Support Vector Machines, IEE Intelligent Systems, Jul./Aug. 1998.
A Kolcz et al., “SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs,” TextDM' 2001 (IEEE ICDM-2001 Workshop on Text Mining), San Jose, CA 2001.
R. Hall, “A Countermeasure to Duplicate-detecting Anti-spam Techniques”, AT&T Technical Report 99.9.1, 1999.
T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, University of Dortmund, Computer Science Dept., LS-8 Report 23, 1998.
Bart Massey et al., Learning Spam: Simple Techniques for Freely-Available Software, Computer Science Dept., Portland, OR USA, 2003, pp. 1-14.
J. Dudley, “Telstra targets Net spammers”, news.com.au, Dec. 2, 2003.
http://www.paulgraham.com/better.html, Better Bayesian Filtering, Jan. 2003, pp. 1-11.
http://www.palimine.net/qmail/tarpit.html, “Tarpitting with qmail-smtpd”, p. 1.
Dale Woolridge et al., “qmail-spamthrottle (5)—the spam throttle mechanism”, http://spamthrottle.qmail.ca/man/qmail-spamthrottle.5.html, pp. 1-4.
P. Bennett,Assessing the calibration of native bayes' posterior estimates,report, Dept. of Computer Science, School of Science, Carnegie Mellon University 2000.
L. Breiman,Out-of-bag estimation,tech. report, Department of Statistics, University of California Berkeley, 1996.
M. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, M. Ares and D. Haussler,Support vector machine classification of microarray gene expression data,Tech. Report UCSC-CRL-99-09, University of California, Santa Cruz, 1999.
P. Domingos,MetaCost: A General method for making classifers cost-sensitive,in Proceedings of the Fifth International Conference of Knowledge Discovery and Data Mining, ACM Press, 1999, pp. 155-164.
S. Dumais, J. Platt, D. Heckerman, and M. Sahami,Inductive learning algorithms and representations for text categorization,in Proceedings of 7thInternational Conference on Information and Knowledge management, 1998, pp. 229-237.
T. Joachims,Text categorization with support vector machines: Learning with many relevant features,in Proceedings of the Tenth European Conference on Machine Learning (ECML-98), 1998, pp. 137-142.
C. Lin,Formulations of support vector machines: A Note from an optimization point of view,Neural Computation, 13 (2001), pp. 307-317.
D. Margineantu and T. Dietterich,Bootstrap methods for the cost-sensitive evalution of classifiers,in Proceedings of the 2000 International Conference on Machine Learning, ICML-2000, 2000.
K. Morik, M. Imhoff, P. Brockhausen, T. Joachims, and U. Gather,Knowledge and discovery and knowledge validation in intensive care,Artifical Intelligence in Medicine, 19 (2000), pp. 225-249.
J. Platt,Fast training of support vector machines using sequential minimal optimization,in Advances in Kernel Methods - Support Vector Learning, B. Schölkopf, C. Burges, and A. Smola, eds., MIT Press, 1999.
V. N. Vapnik,Statistical Learning Theory,Chapter 10, John Wiley, New York, 1998.
G. Wahba,Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV,in Advances in Kernel Methods: Support Vector Learning, B. Schölkopf, C. Burges, and A. Smola, eds., MIT Press, 1999, pp. 69-88.
B. Zadrozny and C. Elkan,Learning and making decisions when costs and probabilities are both unknown,Tech. Report CS2001-0664, UCSD, 2001.
I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, and C. Spyropoulos. An Evaluation of Naïve Bayesian Anti-Spam Filtering. In G. Potamias, V. Moutakis, and M. van Someren, editors,Proceedingd of the Workshop on Machine Learning in the New Information Age: 11thEurpoean Conference on Machine Learning (ECML 2000), pp. 9-17. 2000.
I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, and C. Spyropoulas. An Experimental Comparison of Naïve Bayesian and Keyword-Based Anti-Spam Filtering and Encrypted Personal E-mail Messages. In N. Belkin, P. Ingwersen, and M. Leong, editors,Proceedings of the 23rdAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000),pp. 160-167. 2000.
I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. Spyropoulos, and P. Stamatopoulos. Learning to Filter Spam E-Mail: A Comparison of a Naïve Bayesian and a Memory-Based Approach. In H. Zaragoza, P. Gallinari, and M. Rajman, editors,Proceedings of the Workshop on Machine Learning and Textual Information Access, 4thEuropean Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), pp. 1-13. 2000.
A. Bradley, The use of the area under the ROC curve in the evalution of machine learning algorithms,Pattern Recoginization,30(7):1145-1159, 1997.
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. InProceeding of SIGMOD,pp. 398-409, 1995.
Broder. On the resemblance and containment of documents. SEQS:Sequences '91, 1998.
C. Buckley, C. Cardie, S. Mardisa, M. Mitra, D. Pierce, K. Wagstaff, and J. Waltz. The smart/empire tipster ir system. InTIPSTER Phase III Proceedings.Morgan Kaufmann, 2000.
A. Chowdbury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection.ACM Transactions on Information Systems,20(2):171-191, 2002.
W. Cohen. Fast effective rule induction. InProceedings of the Twelfth International Conference on Machine Learning,1995.
W. Cohen. Learning Rules that classify E-Mail. InProceedings of the 1996 AAAI Spring Symposium on Machine Learning in Information Access,1996.
J. Diederich, J. Kindermann, E. Leopold, and G. Paass. Authorship Attribution with Support Vector Machines. InProceedings of the Learning Workshop,Snowbird, Utah, 2000.
H. Drucker, D. Wu, And V. Vapnik. Support Vector Machines for Spam Categorization.IEEE Transactions on Neural Networks,10(5):1048-1054, 1999.
S. Eyheramendy, D. Lewis, and D. Madigan. On the Naïve Bayes model for text categorization. InProceedings of the Ninth International Workshop on Artificial Intelligence and Statistics,2003.
T. Fawcett. ROC graphs: Notes and practical considerations. Technical Report HPL-2003-4, HP Labs, 2003.
N. Heintze. Scalable documen

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Classifier tuning based on data similarities does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Classifier tuning based on data similarities, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Classifier tuning based on data similarities will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3686501

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.