Data processing: artificial intelligence – Neural network – Learning method
Reexamination Certificate
1999-11-23
2004-04-27
Starks, Jr., Wilbert L. (Department: 2121)
Data processing: artificial intelligence
Neural network
Learning method
C706S012000
Reexamination Certificate
active
06728690
ABSTRACT:
FIELD OF THE INVENTION
This invention relates generally to the field of classifiers and in particular to a trainer for classifiers employing maximum margin back-propagation with probability outputs.
COPYRIGHT NOTICE/PERMISSION
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. The following notice applies to the software and data as described below and in the drawing hereto: Copyright©1999, Microsoft Corporation, All Rights Reserved.
BACKGROUND
The problem of determining how to classify things is well known. Humans have an innate ability to classify things quickly. Many intelligence tests are based on this ability. For a human, it is quite simple to determine whether a coin is a penny, a nickel or other denomination. We can tell whether or not an email is junk mail or spam, which we don't want to read, or if it is an email from a friend or customer, that we need to read quickly. However, many humans do not have a comprehensive knowledge of how to determine whether the word “affect” or “effect” should be used when writing a sentence. There have been many attempts to program computers to classify things. The math involved can get quite complex. It was difficult early on to get a computer to classify coins, but neural networks, which were trained with many sets of input could do a pretty good job. The training for such simple classifiers was fairly straight forward due to the simplicity of the classifications.
However, classifying things as junk email or spam is much more difficult, and involves the use of complex mathematical functions. The rest of this background section describes some prior attempts at training classifiers. They involved the selection and modification of parameters for functions that compute indications of whether or not a particular input is likely in each of multiple categories. They involve the use of training sets, which are fed into classifiers, whose raw output is processed to modify the parameters. After several times through the training set, the parameters converge on what is thought to be a good set of parameters that when used in the classifier on input not in the training set, produces good raw output representative of the proper categories.
This raw output however, might not be directly related to the percentage likelihood that a particular category is the right one. This has lead to the use of a further function, called a probability transducer, that takes the raw output, and produces a probability for each category for the given input. There are many problems that remain with such systems. It is difficult to produce one that provides good performance, and does not over fit to the training data. Over fitting to the training data can produce wrong results, unless a very large amount of training data is provided. This, however, severely impacts performance.
A standard back-propagation architecture is shown in prior art FIG.
1
. Functions used in this description are indicated by numbers in parentheses for ease of reference. Back-propagation is well-known in the art. It is described fully in section 4.8 in the book “Neural Networks for Pattern Recognition” by Christopher M. Bishop, published by Oxford Press in 1995. The purpose of a back-propagation architecture is to train a classifier
105
(and associated probability transducer
125
) to produce probabilities of category membership given an input. The classifier
105
is trained by example. A training set comprising training inputs
100
is provided to the classifier
105
. The classifier
105
has parameters
110
which are like adjustable coefficients for functions used to calculate raw outputs
120
indicative of class membership. The parameters
110
are adjusted so that the final probability outputs
135
are close to a set of desired outputs
145
for the training set. Given enough training inputs
100
and associated desired outputs
145
, and a sensible choice of classifier
105
, the overall system will generalize correctly to inputs that are not identical to those found in training inputs
100
. In other words, it will produce a high probability that the input belongs in the correct category.
The functional form of classifier
105
can take many forms. The output
120
of the classifier
105
can be a single value, if the classifier is designed to make a single category probability. Alternatively, the output
120
of the classifier
105
can be a plurality of values, if the classifier is designed to produce probabilities for multiple categories. For purposes of discussion, the output
120
of a classifier
105
is denoted by a vector y
i
, regardless of whether output
120
is a single output or a plurality of outputs. Similarly, one example of training input is denoted by a vector x
j
, regardless of the true dimensionality of the training input.
Back-propagation is applicable to any classifier
105
that has a differentiable transfer function that is parameterized by one or more parameters
110
. One typical functional form of classifier
105
is linear:
y
i
=
∑
j
⁢
⁢
w
ij
⁢
x
j
-
b
i
(
1
)
where w
ij
and b
i
are the parameters
110
, j is the number of classes, and there are no internal state variables
115
. Another typical functional form of classifier
105
is a linear superposition of basis functions:
y
i
=
∑
k
⁢
w
ik
⁢
ϕ
k
⁡
(
x
ρ
,
θ
μ
k
)
-
b
i
(
2
)
where the W
ik
, b
i
, and &thgr;
k
are the parameters
110
, and the &PHgr;
k
are the internal state of parameters
115
. The parameterized functions &PHgr;
k
can take on many different forms, such as Gaussians or logistic functions, or multi-layer perceptrons, as is well-known in the art (see chapters 4 and 5 of the Bishop book). Other possible forms (e.g., convolutional neural networks) are known in the art.
The desired output
120
of a classifier
105
is a probability of a category
135
. These probability outputs are denoted as p
i
, again regardless of whether there is one probability value or a plurality of probability values. Typically, a probability transducer
125
is applied to the output
120
of the classifier
105
to convert it into a probability. When there is only one output or when the outputs are not mutually exclusive categories, a sigmoidal (or logit) function is typically used:
p
i
=
1
1
+
ⅇ
A
i
⁢
y
i
+
B
i
(
3
)
where A
i
is typically fixed to −1, while B
i
is typically fixed to 0. The parameters A
i
and B
i
are the fixed parameters
130
. When a logit probability transducer is coupled to a linear classifier, the overall system performs a classical logistic regression. When the classifier
105
is deciding between a plurality of mutually exclusive categories, a softmax or winner-take-all function is often used:
p
i
=
ⅇ
A
i
⁢
y
i
+
B
i
∑
j
⁢
ⅇ
A
j
⁢
y
j
+
B
j
(
4
)
where A
i
is typically fixed to +1 and B
i
is typically fixed to 0. Again, the parameters A
i
and B
i
are the fixed parameters
130
.
The back-propagation architecture will attempt to adjust the parameters of the classifier
110
to cause the probability outputs
135
to be close to the desired outputs
145
. The values of the desired outputs
145
are denoted as t
i
. The values of the desired outputs
145
are typically (although not always) either 0 or 1, depending on whether a specific training input
100
belongs to the category or not.
The closeness of the probability output
135
to the desired output
145
is measured by an error metric
140
. The error metric computes an error function for one particular training input, whose corresponding parameters and values are denoted with a superscript (n). The error metric computes a function E
(n)
(p
i
(n)
, t
i
(n)
) which is at a minimum when p
i
(n)
matches t
i
(n)
. The o
Meek Christopher A.
Platt John C.
Lyon & Harr LLP
Microsoft Corporation
Watson Mark A.
LandOfFree
Classification system trainer employing maximum margin... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Classification system trainer employing maximum margin..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Classification system trainer employing maximum margin... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3190720