Data processing: artificial intelligence – Knowledge processing system – Knowledge representation and reasoning technique
Reexamination Certificate
2000-06-08
2003-02-11
Davis, George B. (Department: 2121)
Data processing: artificial intelligence
Knowledge processing system
Knowledge representation and reasoning technique
C706S016000
Reexamination Certificate
active
06519580
ABSTRACT:
DESCRIPTION
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to supervised learning as applied to text categorization, and, more particularly, to a method for categorizing messages or documents containing text.
2. Background Description
The text categorization problem is to determine predefined categories for an incoming unlabeled message or document containing text based on information extracted from a training set of labeled messages or documents. Text categorization is an important practical problem for companies that wish to use computers to categorize incoming email, thereby either enabling an automatic machine response to the email or simply ensuring that the email reaches the correct human recipient. Beyond email, text items to be categorized may come from many sources, including the output of voice recognition software, collections of documents (e.g., news stories, patents, or case summaries), and the contents of web pages.
For the purposes of the following description, any data item containing text is referred to as a document, and the term herein is to be taken in this most general sense.
Previous text categorization methods have used decision trees, naive Bayes classifiers, nearest neighbor methods, neural nets, support vector machines and various kinds of symbolic rule induction.
The present invention relates to symbolic rule induction systems, so such systems will now be described at a general level that is known in the art. In such a system, data is represented as vectors in which the components are numerical values associated with certain features of the data. The system induces rules from the training data, and the generated rules can then be used to categorize arbitrary data that is similar to the training data. Each rule ultimately produced by such a system states that a condition, which is usually a conjunction of simpler conditions, implies membership in a particular category. The condition forms the antecedent of the rule and the conclusion posited as true when the condition is satisfied is the consequent of the rule. Usually, a data item is represented as a vector of numerical components, with each component corresponding to a possible feature of the data, and antecedent of a rule is combination of tests to be done on various components. Under a scenario in which features are words that may appear in a document and the corresponding numerical values in vectors representing documents are word counts, an example of a rule is
share>3 & year<=1 & acquire>2→acq
which may be read as “if the word ‘share’ occurs more than three times in the document and the word ‘year’ occurs at most one time in the document and the word ‘acquire’ occurs more than twice in the document, then classify the document in the category ‘acq’.” Here the antecedent is
share>3 & year<=1 & acquire>2
and the consequent is acq. Alternatively, the rule above could be read as “if words equivalent to ‘share’ occur more than three times in the document and words equivalent to ‘year’ occur at most one time in the document and words equivalent to ‘acquire’ occur more than twice in the document, then classify the document in the category ‘acq’.” This later reading of the rule reflects an assumption that stemming was done. Stemming is the replacement of words by corresponding canonical forms (or stems). Existing symbolic rule induction systems do not categorize documents accurately enough for many commercial applications, or their training time is excessive, or both.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to provide a method to automatically categorize messages or documents containing text. The hitherto unsolved practical problem in the field of text categorization is to provide a general text categorization system that in turn provides superior performance in six different ways. These six aspects, which will be explained in more detail below, are:
1. precision,
2. recall,
3. provision for multiple categorization,
4. provision of confidence levels,
5. training speed, and
6. insight and control.
Previous systems fall short on one or more of these desired features. The present invention solves this problem by delivering high performance or providing required functionality in each way.
Precision and recall (1 and 2) are basic measures of the performance of a categorizer. Precision is the proportion of the decisions to place documents in specific categories made by a text categorization system that are correct. Recall is the proportion of the actual category assignments that are identified correctly by a text categorization system. Precision and recall are much more useful measures of performance in the area of text categorization than the error rate, which is commonly used in most other areas of machine learning. This is because, in text categorization, one typically has many small categories, and so one could obtain a categorizer with a low error rate by simply using a categorizer that placed no document in any category, but such a categorizer would have very little practical utility. Of course, there is a connection between a categorizer's error rate, on one hand, and a categorizer's recall and precision, on the other because one cannot simultaneously have excellent recall and precision along with a poor error rate.
Multiple categorization (3) is the possibility for a single document to be assigned to more than one category. This is an essential kind of flexibility needed in many applications. However, a text categorization system that provides for multiple categorization is well-served by a method for assessing the significance of more than one category being assigned to a document. Such a method is the provision of confidence levels (4).
Confidence levels are quantified relative indicators of the level of confidence that may be placed in a categorizer's recommendations. Confidence levels are real numbers typically ranging from 0.0 to 1.0 inclusive, with 0.0 indicating lowest confidence and 1.0 indicating greatest confidence. Confidence levels are particularly important in practical applications of text categorization such as routing email or sending automatic responses to email. Applications of this method should make significant use of confidence levels in evaluating possible alternatives related to a categorizer's assignment of categories to a document. However, previous symbolic rule induction text systems for text categorization have not provided confidence levels as part of the rules.
Training speed (5) refers to the time it takes for a computer to generate a categorizer from training data.
Finally, insight and control (6) refers to the ability of people to understand and modify manually a text categorizer. This is extremely important in real commercial applications in which enterprises frequently have gaps in the coverage of their training data. Inability to compensate for a gap in data coverage could doom a text-categorization-dependent application, such as routing or automatically responding to email. Approaches used in the prior art for text categorization preclude manual intervention. One corollary to the desire for insight and control is that the justifications for a text categorization system's recommendations should be a simple as possible.
According to the invention, a method of solution fits in the general framework of supervised learning, in which a rule or rules for categorizing data is automatically constructed by a computer on the basis of training data that has been labeled with a predefined set of categories beforehand. More specifically, the method for rule induction involves the novel combination of:
1. inducing from the training data a decision tree for each category;
2. the automated construction from each decision tree of a simplified symbolic rule set that is logically equivalent overall to the decision tree, and which is to be used for categorization instead of the decision tree; and
3. the determination of a confidence lev
Johnson David E.
Oles Frank J.
Zhang Tong
Davis George B.
International Business Machines - Corporation
Kaufman Stephen C.
Whitham Curtis & Christofferson, P.C.
LandOfFree
Decision-tree-based symbolic rule induction system for text... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Decision-tree-based symbolic rule induction system for text..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Decision-tree-based symbolic rule induction system for text... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3179836