Text categorizers based on regularizing adaptations of the...

Data processing: artificial intelligence – Machine learning

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Text categorizers based on regularizing adaptations of the... Text categorizers based on regularizing adaptations of the...

: 2000-02-11
: 2003-05-27
: Follansbee, John (Department: 2121)
: Data processing: artificial intelligence
: Machine learning

: C707S793000
: Reexamination Certificate
: active
: 06571225
: ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to supervised learning as applied to text categorization and, more particularly, to a method for categorizing messages or documents containing text.
2. Background Description
The text categorization problem is to determine predefined categories for an incoming unlabeled message or document containing text based on information extracted from a training set of labeled messages or documents. Text categorization is an important practical problem for companies that wish to use computers to categorize incoming e-mail (electronic mail), thereby either enabling an automatic machine response to the e-mail or simply assuring that the e-mail reaches the correct human recipient. But, beyond e-mail, text items to be categorized may come from many sources, including the output of voice recognition software, collections of documents (e.g., news stories, patents, or case summaries), and the contents of web pages.
Any data item containing text is referred to as a document, and the term herein is meant to be taken in this most general sense.
The text categorization problem can be reduced to a set of binary classification problems—one for each category—where for each one wishes to determine a method for predicting in-class versus out-of-class membership. Such supervised learning problems have been widely studied in the past. Recently, many methods developed for classification problems have been applied to text categorization. For example, Chidanand Apte, Fred Damerau, and Sholom M. Weiss in “Automated learning of decision rules for text categorization”,
ACM Transaction on Information Systems
, 12:233-251 (1994), applied an inductive rule learning algorithm, SWAP1, to the text categorization problem. Yiming Yang and Christopher G. Chute in “An example-based mapping method for text categorization and retrieval”,
ACM Transactions on Information Systems
, 12:252-277 (1994), proposed a linear least square fitting algorithm to train linear classifiers. Yiming Yang also compared a number of statistical methods for text categorization in “An evaluation of statistical approaches to text categorization”,
Information Retrieval Journal
, 1:66-99 (1999). The best performances previously reported in the literature are from weighted resampled decision trees (i.e., boosting) in “Maximizing text-mining performance” by S. M. Weiss, C. Apte, F. Damerau, D. E. Johnson, F. L. Oles, T. Goetz, and T. Hampp,
IEEE Intelligent Systems
, 14:63-69 (1999), and support vector machines in “Inductive learning algorithms and representations for text categorization” by S. Dumais, J. Platt, D. Heckerman, and M. Sahami, Technical Report, Microsoft Research (1998). However, training the classifier in these approaches is much slower than the method we will be presenting here.
Common to all these approaches is the use of a numeric vector to represent a document. This can be done in many ways. Because of the vast numbers of different words that may appear in text, generally one gets sparse vectors of very high dimensionality as document representations. Thus, text categorization necessitates using techniques of supervised learning that are well suited to high dimensional data.
Formally, a two-class pattern recognition problem is to determine a label y&egr;{−1, 1} associated with a vector x of input variables. A useful method for solving this problem is by using linear discriminant functions, which consist of linear combinations of the input variables. Various techniques have been proposed for determining the weight values for linear discriminant classifiers from a training set of labeled data (x
1
, y
1
), . . . , (x
n
, y
n
). Here, and throughout this document, n is the number of items in a training set. Specifically, desired are a weight vector w and a threshold t such that w
T
x>t if its label y=−1 and w
T
x≧t if its label y=1. Here, the notation w
T
x means the product using matrix multiplication of the transpose of the column vector w and the column vector x, which is the same as the inner product w and x, which is the same as the dot product of w and x. Thus, the hyperplane consisting of all x such that w
T
x=t would approximately separate the in-class vectors from the out-of-class vectors.
The problem just described may readily be converted into one in which the threshold t is taken to be zero. One does this by embedding all the data into a space with one more dimension, and then translating the original space by some chosen nonzero distance A from its original position. Normally, one takes A=1. Hence, in this conversion, each vector (z
1
, . . . , z
m
) is traded in for (z
1
, . . . , z
m
, A). For each hyperplane in the original space, there is a unique hyperplane in the larger space that passes through the origin of the larger space. Instead of searching for both an m-dimensional weight vector along with a threshold t, one can therefore search for an (m+1)-dimensional weight vector along with an anticipated threshold of zero.
Under the assumption that the vectors of input variables have been suitably transformed so that we may take t=0, the training error rate for a linear classifier with weight vector w is given by
1
n
⁢
∑
i
=
1
n
⁢

⁢
f
⁡
(
w
T
⁢
x
i
⁢
y
i
)
,
(
1
)
where f is the step function
&AutoLeftMatch;
f
⁡
(
x
)
=
{
1
⁢

⁢
if
⁢

⁢
x
≤
0
,
0
⁢

⁢
if
⁢

⁢
x
>
0.
(
2
)
A number of approaches to solving categorization problems by finding linear discriminant functions have been advanced over the years. In the early statistical literature, the weight was obtained by using linear discriminant analysis, which makes the assumption that each class has a Gaussian distribution (see, for example, B. D. Ripley,
Pattern Recognition and Neural Networks
, Cambridge University Press (1996), chapter 3). Similar to linear discriminant analysis, an approach widely used in the neural net community is the least square fitting algorithm. Least square fitting has been applied to text categorization problems as described by Yiming Yang et al., supra. Without any assumption on the distribution, a linear separator can be obtained by using the perceptron algorithm that minimizes training error as described by M. L. Minsky and S. A. Papert in
Perceptrons
, Expanded Ed., MIT Press, Cambridge, Mass., (1990).
In an attempt to deal with the problem of overfitting the training data, some newer techniques have a theoretical basis in the analysis of the generalization error of classification methods that aim to minimize the training error. This analysis often involves a concept called VC dimension, which was originally discovered by V. N. Vapnik and A. J. Chervonenkis in “On the uniform convergence of relative frequencies of events to their probabilities”,
Theory of Probability and Applications
, 16:264-280 (1971), and, independently, by N. Sauer in “On the density of families of sets”,
Journal of Combinational Theory
(
Series A
), 13:145-147 (1972). However, in general, the VC-dimension is proportional to the dimension d of the underlying variable x. Vapnik later realized that by restricting the magnitude of the weights, it is possible to achieve generalization performance which is independent of d (V. N. Vapnik,
The Nature of Statistical Learning Theory
, Springer-Verlag, New York (1995)). The idea of restricting the magnitude of the weights has been known in the neural net community, and was analyzed by P. L. Bartlett in “The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network”,
IEEE Transactions on Information Theory
, 44(2):525-536 (1998).
It is also known that the generalization performance of a linear classifier trained to minimize the training error is determined by its capacity, which can be measured by the concept of covering number, originally studied by A. N. Kolmogorov and V. M. Tihomirov, “&egr;-

Affiliated with

Oles Frank J.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Zhang Tong

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Follansbee John

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Hirl Joseph P.

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Kaufman Stephen C.

Attorney

[ 0.00 ] – not rated yet Voters 0 Comments 0

Whitham Curtis & Christofferson, P.C.

Law Firm

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Text categorizers based on regularizing adaptations of the... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Text categorizers based on regularizing adaptations of the..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Text categorizers based on regularizing adaptations of the... will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-3005667

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure