In Figure 2, you can see that it is
impossible to draw a single line
separating the true from the false
output classifications for this problem.
In order to solve the XOR function, we
need multiple layers of neurons, and
in order to teach multiple layers of
neurons, we need backpropagation.
Gradient Descent
Learning
Before we delve into backpropagation, we are going to look
at two techniques which have
allowed neural networks to develop
finely-tuned error measurement:
continuous activation functions and
the delta rule. Together these
techniques allow for an intelligent
learning technique called gradient
descent learning.
With our perceptron, we used a
linear, hard-limiting activation function.
This means that we picked a strict
threshold and decided every output
above the threshold would output high
and every output below the threshold
would output low. With multi-layered
neural networks, we are going to
use a form of sigmoidal (s-shaped),
non-linear activation called the logistic
function:
activation = 1.0 / (1.0 +
exp( - input sum))
In other words, each individual
neuron is activated to 1 / 1 + the
exponential function of the negative
sum of all its inputs. For example, let’s
look at a neuron with two input
weights: w0 with a value of . 3 and w1
with a value of . 65. If both weights are
connected to input values of 1,
the neuron activation will be computed
as follows:
input sum = (input0 * weight0) +
(input1 * weight1)
input sum = (1 * . 3) + (1 * . 65) = .95
activation = 1.0 / (1.0 + exp ( - .95))
activation = 0.721
If both inputs were activated to
values of 0 instead of 1, the activation
equations would look like this:
input sum = (0 * . 3) + (0 * . 65) = 0
activation = 1.0 / (1.0 + exp ( 0 ))
= 0.5
As you can see, the logistic
activation function normalizes values
around a center point of 0.5.
Now that we have a continuous
learning function, we can think of
neural network error as a curve in
two-dimensional space and use the
delta rule to minimize error during
each learning iteration. We don’t
need to get too far into the mathematical details here, just imagine
that the error curve is made up of
points from every possible configuration of network weights, and gradient
descent is the slope of different
portions of this curve. With each
learning iteration, we want to
minimize our error by changing
individual neuron weights in the most
beneficial direction, and we can do
this by using the delta rule.
Simply stated, the delta rule chooses the direction of traversal on the
error curve which most rapidly reduces
our error. The delta rule formula is:
change in weight =
learning constant * (desired output
– actual output) * f(x) * (1-f(x))
where f(x) is the logistic activation
function described above.
Feedforward
Now that we have an activation
function and a learning function, we
are ready to assemble our network. In
order to solve the XOR problem, we
need two inputs, one output, and four
hidden neurons plus one bias neuron
on both the input and hidden layers.
The network representation is shown
in Figure 3.
From the diagram, you can see
how network layers of nodes and
connections easily translate into arrays
of activation and weight values in our
C program.
DIFFERENT BITS
These networks are called
feedforward because activation flows
through in a forward direction from
inputs to hidden and finally output.
Feedforward networks can have any
number of inputs, outputs, and hidden
neurons, but for the XOR example, this
is all we need.
We calculate the output of the
network by feeding activation from
input to output. For example, if we
start with input 0 = 1 and input 1 = 0,
we would begin by calculating the
activations of each hidden neuron as
we did above, but this time we will
include bias neurons:
hidden neuron 0
input sum = (input 0 * weight 0) +
(input 1 * weight 1) +
(bias 0 * weight 8)
activation = 1.0 / (1.0 + exp
( - input sum))
We add up our final output activation in the same way:
output activation
input sum = (hidden 0 * weight 0)
+ (hidden 1 * weight 1) + (hidden
2 * weight 2) + (hidden 3 * weight
3) + (bias 1 * weight 4)
activation = 1.0 / (1.0 + exp
( - input sum))
Propagate Back
Once we know our output
activation, we can compare it to our
desired output and adjust the weights
of the network toward this output.
Continuing our XOR example, if we
input (1, 0) we would like an output
of 1. If we get an output of . 43, we
need to adjust the individual weights
to make this happen. We don’t want
to adjust them too much though or
we will make it impossible to get
an output of 0 when we have an
input of (1,1), so we proceed to
tweak the weights ever so slightly
using the delta rule.
We begin by adjusting the weights
connected to the output neuron based
SERVO 09.2007 15