Foundations of Deep Learning (Hugo Larochelle, Twitter)
zij_FTbJHsk • 2016-09-27
Transcript preview
Open
Kind: captions
Language: en
That's good. All right. Cool. So, uh,
yeah. So, I was asked to, uh, give this
presentation on the foundations of deep
learning, which is mostly going over,
uh, basic feed forward neural networks
and motivating a little bit deep
learning and some of the more recent
developments and and some of the topics
that you'll see across the next two
days. So, um, I, as uh, uh, Andrew
mentioned, I have just an hour, so I'm
going to go fairly quickly on a lot of
these things, which I think will mostly
be fine if you're familiar enough with
some machine learning and, uh, uh, a
little bit about neural nets. But if
you'd like to go into some of the more
specific details, you can go check out
my online lectures on YouTube. Uh, it's
now taught by a much younger version of
myself. uh and uh so just search for
Google Averell and I am not the guy
doing a bunch of skateboarding. I'm the
geek teaching about neural nets. So go
check those out if you want more
details. But so what I'll cover today is
uh I'll start with just describing and
laying out the notation on feed for
neural networks that is models that take
an input vector x that might be an image
or some text and produces an output f
ofx. So I'll just describe for
propagation and the different types of
units and the type of functions we can
represent with those and then I'll talk
about how we actually train neural nets
uh describing things like loss functions
back propagation uh that allows us to
get a gradient for training with
stoastic gradient descent and mention a
few tricks of the trade. So some of the
things we do in practice to uh
successfully train neural nets and then
I'll end by talking about some um
developments that are specifically
useful in the context of deep learning
that is neural networks with several
hidden layers uh that came out you know
at the very uh after the beginning of of
deep learning say in 2006 that is things
like dropout batch normalization and if
I have some time uh unsupervised
pre-training so let's get started and
just talk about assuming we have some
neural network how do they actually
functions how do they make predictions
um so let me lay down the notation um so
a multi-layer feed forward neural
network is a model that uh takes as
input some vector x which I'm
representing here with a different node
for each of the dimensions uh in my
input vector so each dimension is
essentially a unit in that uh neural
network and then it eventually produces
at its output layer a uh an output And
we'll focus on classification mostly. So
you'll have multiple units here and each
unit would correspond to one of the
potential classes in which we would want
to classify our input. So if we're
identifying uh digits in handwritten
character images uh and and say we're
focusing on digits, you'd have 10
digits. So you would have so zero from
zero to nine. So you'd have 10 output
units. And to produce an output, the
neural net will go through a series of
hidden layers. um and those will be
essentially the components that
introduce nonlinearity that allows us to
capture and perform very sophisticated
types of classification functions. So if
we have L hidden layers uh the way we
compute all the layers in our neural net
is as follows. Uh we first start by
computing what I'm going to call a
pre-activation. I'm going to note that a
and I'm I'm going to index the layers by
k. So AK is just the uh pre-activation
at layer K and that is only simply going
to be a linear transformation of the
previous layer. So I'm going to note H K
as the activation and the layer and by
default I'll assume that layer zero is
going to be the input. And so using that
notation, the pre-activation at layer K
is going to correspond to taking the
activation at the previous layer K minus
one, multiplying it by a matrix W K.
Those are the parameters of the layer.
Uh those essentially corresponds to the
uh connections between the units between
adjacent layers. And I'm going to add a
bias vector. That's another parameter in
my layer. So that gives me the
pre-activation. And then next I'm going
to get a hidden layer activation by
applying an activation function. This
will introduce some nonlinearity in the
model. So I'm going to call that
function G. And we'll go over a few uh
choices that we have for um common
choices for the activation function. And
uh so I do this from layer 1 to layer L.
And when it comes to the output layer,
I'll also compute a pre-activation by
performing a linear transformation. But
then I'll usually apply a different
activation function depending on the
problem I'm trying to solve. So
um having said that let's go to some of
the choices for the activation function.
So some of the activations functions
you'll see one common one is the sigmoid
activation function. Uh it's this
function here. It's just one divided by
1 plus the exponential of minus the
pre-activation. The shape of this
function you can focus on that is this
here. It takes the pre-activation which
can vary from minus infinite to plus
infinite and it squashes this between
zero and one. So it's bounded by below
and above below by zero and above by
one. Okay. So it's a it's a function
that saturates if you have very large or
very um large magnitude positive or
negative pre-activations.
Uh another common choice is the
hyperbolic tangent or tang activation
function on this picture here. So
squashes everything but instead of being
between zero and one it's between minus
one and one.
And one that's become quite popular uh
in neural nets is what's known as the
rectified linear activation function or
in papers you will see the uh relu unit
that refers to the use of this
activation function.
So this one is different from the others
in that it's not bounded above but it is
bounded below and it's actually uh uh it
will output zeros exactly if the
pre-activation is uh negative.
So those are the choices of activation
functions for the hidden layers. And for
the output layer, if we're performing
classification, as I said, in our output
layer, we will have as many units as
there are classes in which an input
could belong. And what we'd like is
potentially um and what we often do is
interpret each unit's activation as the
probability according to the neural
network that the input belongs to the
corresponding class that it's labeled Y
is the corresponding class C. So C would
be like the index of that unit in the
output layer. So we need an activation
function that produces probabilities
produces a multinnomial distribution
over all the different classes. And the
activation function we use for that is
known as the softmax activation
function. Uh it is simply as follows.
You take your pre-activations and you
exponentiate them. So that's going to
give us positive numbers and then we
divide each of the exponentiated
pre-activations by the sum of all the
pre uh the exponentiated
pre-activations. So because I'm
normalizing this way, it means that all
my uh values in my output layer are
going to sum to one and they're positive
because I took the exponential. So I can
interpret that as a multinnomormal
distribution over the choice of all the
C different classes. Okay, so that's
what I'll use as the activation function
at the output layer.
Um and now beyond the math in terms of
conceptually and also in the way we're
going to program neural networks often
what we'll do is that all these
different operations the linear
transformations the different types of
activation functions uh will essentially
implement all of them as an object and
uh object that take arguments uh and the
arguments would essentially be what
other things are being combined to
produce the next value. So for instance,
we would have an object that might
correspond to the uh computation of
pre-activation which would take as
argument what is the weight matrix and
the bias vector for that layer and take
some layer to transform and that would
this object would sort of compute its
value by applying the linear activation
uh the linear transformation and then we
might have objects that correspond the
specific you know uh activation
function. So like a sigmoid object or a
tangent object or relu object and we
just combine these objects together
chain them into what ends up being a
graph which I refer to as a flow graph
that represents the computation done
when you do a forward pass in your
neural network up until you reach the
output layer. So I mention it now
because that's you'll see you know the
different softwares that we presented
over the weekend uh will essentially
sort of you know exploit some of that
representation of the computation in
neural nets and it also be handy for
computing gradients which I'll talk
about in a few minutes.
And so that's how we perform predictions
in neural networks. So we get an input
we eventually reach an output layer that
gives us a distribution over classes if
we're performing classification. If I
want to actually classify, I would just
assign the class corresponding to the
unit that has the highest activation
that would correspond to classifying
into the class that has the highest
probability according to the neural net.
And
but then you might ask the question,
okay, what kind of problems can we solve
with neural networks? Or more
technically, what kind of functions can
we represent mapping from some input X
into some arbitrary output? And um so if
you look at if you go look at my videos
I try to give more intuition as to why
we have this result here. But
essentially uh if we have a single
hidden layer neural network uh it's been
shown that with a linear output we can
approximate any continuous function
arbitrarily well as long as we have
enough hidden units. So that is there's
a value for these biases and these
weights such that any continuous
function I can actually represent it as
well as I want. I just need to add
enough hidden units. Um, so this result
applies if you use activation functions,
nonlinear activation functions like
sigmoid and tanh. Um, so as I said in my
video, if you want a bit more intuition
as to why that would be, uh, you can go
check that out. Um, but that's a really
nice result. It means that by focusing
on this family of machine learning
models that are neural networks, I can
pretty much potentially represent any
kind of classification function.
However, this result does not tell us
how do we actually find the weights and
the bias values such that I can
represent a given function. It doesn't
essentially tell us how do we train a
neural network. And so that's what we'll
discuss next. So let's talk about that.
How do we actually from a data set train
a neural network to perform good
classification on uh for that problem?
So uh what we'll typically do is use a
framework that's very generic in machine
learning known as empirical risk
minimization or structural risk
minimization if you're using
regularization. So this framework
essentially transforms the problem of
learning as a problem of optimizing. So
what we'll do is that we'll first choose
a loss function that I'm noting as L.
And the loss function it compares the
output of my model. So the output layer
of my neural network with the actual
target. So I'm indexing with an exponent
here with t to uh essentially as the
index over all my different examples in
my training set. And so my loss function
will tell me is this output good or bad
given uh that the label is actually y.
And what I'll do I'll also define a
regularizer. Um so theta here is you can
think of it as just the concatenation of
all my biases and all of my weights in
my neural net. So those are all the
parameters of my neural network and the
regularizer will essentially penalize
certain values of of of these weights.
So as I'll talk more specifically later
on for instance you might want to have
your weights not be too far from zero.
That's a frequent intuition that we
implement with regularizer.
And so the optimization problem that
we'll try to solve when learning is to
minimize the average loss of my neural
network over my training examples. So
summing over all training examples I
have capital T examples plus some u
weight here that's known as the weight
decay some hyperparameter lambda times
my regularizer. So in other words, I'm
going to try to have my uh loss on my
training set as small as possible over
all the training example and also try to
satisfy my regularizer as much as
possible. And so now we have this
optimization problem and we learning
will just correspond to trying to solve
this problem. So performing this finding
this arg here for over my weights and my
biases. And if I want to do this, I can
just invoke some optimization procedure
from the uh uh optimization community.
And the one algorithm that you'll see
constantly in deep learning uh is
stoastic gradient descent. This is the
optimization algorithm that we'll often
use for uh training neural networks. So
SGD stoastic gradient descent functions
as follows. you first initialize all of
your parameters that is finding initial
values for all my weight matrices and
all of my biases
and then for a certain number of epochs.
So an epoch will be a full pass over all
my examples that's what I'll call an
epoch. Um so for a certain number of
full iterations over uh my training set
I'll draw each training example. So a
pair x input x target y and then I'll
compute what is the gradient of my loss
with respect to my parameters all of my
parameters all my weights and all my
biases. This is what this notation here
uh so nabla for the gradient of the loss
function and here I'm indexing with
respect to which parameter I want the
gradient. So I'm going to compute what
is the gradient of my loss function with
respect to my parameters and plus lambda
times the gradient of my regularizer as
well and then I'm going to get a
direction in which I should move my
parameters. uh since the gradient tells
me how to increase the loss uh I want to
go in the opposite direction and
decrease it. So my direction will be the
opposite. So that's why I have a minus
here. And so this delta is going to be
the direction in which I'll move my
parameters by taking a step. And the
step is just a step size alpha which is
often referred to as a learning rate
times my direction which I just add to
my current values of my parameters, my
biases and my weights. And that's going
to give me my new value for all of my
parameters. And I iterate like that over
going over all pairs X-Y's computing my
gradient taking a steps side in the
opposite direction and then doing that
several times. Okay, so that's how
stochastic gradient descent uh works and
that's essentially the learning
procedure. It's it's represented by this
this procedure. So in this algorithm
there are a few things we need to
specify to be able to implement it and
execute it. We need a loss function. a
choice for the loss function. We need a
procedure that's efficient for computing
the gradient of the loss with respect to
my parameters. Uh we need to choose a
regularizer if we want one. And we need
a way of initializing my parameters. So
next what I'll do is I'll go through
each of these uh these four different
things we need to choose before actually
being able to execute stoastic gradient
descent.
So first the loss function. So as I
said, we will interpret the output layer
as assigning probabilities to each
potential class in which I can uh
classify my input X. Well, in this case,
something that would be natural is to
try to maximize the probability of the
correct class, the actual class in which
my example XT belongs to, I'd like to
increase the value of the probability
assigned by computed by my neural
network. Um and so because we set up the
problem in which we had a loss that we
minimize uh instead of maximizing the
probability what we'll actually do is
minimize the negative and the actual uh
log probability. So the log likelihood
of assigning X to the correct class Y.
So this is represented here. So given my
output layer and the true label Y, my
loss will be minus the log of the
probability of Y for my neur according
to my neural net and that would be well
take my output layer and look at the
unit. So index the unit corresponding to
the correct class. So that's why I'm
indexing by Y here. Um we take the log
because numerically it's turns out to be
more stable. We get nicer looking
gradients. uh and sometimes in certain
softwares you'll see instead of talking
about the negative log likelihood or log
probability you'll see it referred as
the cross entropy uh and that's because
you can think of this as performing a
sum over all possible classes and then
for each class checking well is this
potential class the target class. So I
have an indicator function that is one
if y is equal to c. So if my iterator
class C is actually equal to the the
real class, I'm going to multiply that
by the log of the probability actually
assigned to that class C. And this uh
this function here, so this expression
here is like a cross entropy between the
empirical distribution which assigns
zero probability to all the other
classes but a probability of one to the
correct class and the actual
distribution over classes that my neural
net is computing which is f ofx. Okay,
that's just a technical detail. You can
just think about this here. I only
mentioned it because in certain
libraries it's actually mentioned as the
cross entropy loss.
So that's for the loss. Um then we need
also a procedure for computing what is
the gradient of my loss with respect to
all of my parameters in my neural net.
So the biases and the weights. Um you
can go look at my videos if you want the
actual derivation of all the details for
all of these different expressions. Uh I
don't have time for that. So all I'll do
and presumably a lot of you actually
seen you know these derivations if you
haven't just go check out the videos. In
any case I'm going to go through what
the algorithm is. I'm going to highlight
some of the key points that will come up
later in understanding how actually back
propagation functions. So the basic idea
is that we'll compute gradients by
exploiting the chain rule and we'll go
from the top layer all the way to the
bottom computing gradients for uh layers
that are closer and closer to the input
as we go and exploiting the chain rule
to exploit or reuse previous
computations we've made at upper layers
to compute the gradients at the layers
below. So we usually start by computing
what is the gradient at the output
layer. So what's the gradient of my loss
with respect to my output layer and
actually it's it's more convenient to
compute the loss with respect to the
pre-activation. It's actually a very
simple expression. Um so that that's why
I have the gradient of this vector a l +
one. That's the pre-activation at the
very last layer of the loss function
which is minus the log f ofxy.
And it turns out this gradient is super
simple. It's minus e of y. So that's the
one hot vector for class Y. So what this
means is E of Y is just a vector filled
with a bunch of zeros and then a one at
the correct class. So if Y was the
fourth class, then in this case it would
be this vector where I have a one at the
fourth dimension. So E of Y is just a
vector. It's we call it the one hot
vector full of zeros and the single one
at the position corresponding to the
correct class. So what this part of the
grain is essentially saying is that I'm
going to increase I want to increase the
probability of the correct class. I want
to increase the pre-activation which
will increase the probability of the
correct class and I'm going to subtract
what is the current probabilities
assigned by my neural net to all of the
classes. So f ofx that's my output layer
and that's the current beliefs of the
neural net as to in which class uh
what's the probability assigning the
input to each class. So what this is
doing is essentially trying to decrease
the probability of everything and
specifically decrease it as much as I
the neural net currently believes that
the input belongs to it. And so if you
think about the subtraction of these two
things, well for the class that's the
correct class, I'm going to have one
minus some number between 0 and one
because it's a probability. So that's
going to be positive. So I'm going to
increase the probability of the correct
class. And for everything else, it's
going to be zero minus a positive
number. So it's going to be negative. So
I'm actually going to decrease the
probability of everything else. So
intubally it makes sense. This gradient
has the right behavior. And I'm going to
take that pre-activation gradient. I'm
going to propagate it from the top to
the bottom and uh and essentially
iterating from the last layer which is
the output layer L+1 all the way down to
the first layer. And uh as I'm going
down, I'm going to compute the gradients
with respect to my parameters and then
compute what's the gradient for the uh
pre-activation at the layer below and
then iterate like that. So at each uh
iteration of that loop, I take what is
the current gradient of the loss
function with respect to the
pre-activation at the current layer and
I can compute the gradient of the loss
function with respect to my weight
matrix. So not doing the uh derivation
here it it's actually simply this
vector. So my in my notation I assume
that all the vectors are column vectors.
So this pre-activation uh gradient
vector and I multiply it by the
transpose of the activations. So the
value of the layer right below the the
the layer k minus one. So because I take
the transpose that's a multiplication
like this. And you can see if I do the
outer product essentially between these
two vectors, I'm going to get a matrix
of the same size as my weight matrix. So
it all checks out. That makes sense. Uh
turns out that the gradient of the loss
with respect to the bias is exactly the
gradient of the loss with respect to the
pre-activation. So that's very simple.
So that gives me now my gradients for my
parameters. And now I need to compute
okay, what is going to be the gradient
of the pre-activations at the layer
below. Uh well, first I'm going to get
the gradient of the loss function um
with respect to the activation at the
layer below. Well, that's just taking my
pre-activation gradient vector and
multiplying it by for some reason
doesn't show here, but and multiplying
by the transpose of my weight matrix.
Super simple operation. Just a linear
transformation of my gradients at layer
K. linear transform to get my gradients
of the activation at the layer k minus
one. And then to get the gradients of
the pre-activation, so before the
activation function, I'm going to I'm
going to take this gradient here, which
is the gradient of the activation
function at the layer k minus one. And
then I apply the gradient corresponding
to the partial derivative of my
nonlinear activation function. So this
here, this refers to an elementwise
product. So I'm taking these two
vectors, this vector here and this
vector here. I'm going to do an
elementwise product between the two. And
this vector here is just the partial
derivative of the activation function
for each unit individually that I've put
together into a vector. Okay, this is
what this corresponds to. Now the key
things to notice is first that this pass
computing all the gradients and doing
all these iterations is actually fairly
cheap. it's uh complexity is essentially
the same as the one that's doing a a
forward pass. So um all I'm doing are
linear transformations multiplying by
matrices in this case the transpose of
my weight matrix and then I'm also doing
this sort of nonlinear operation where
I'm multiplying by the gradient of the
activation function. So that's the first
thing to notice and the second thing to
notice is that here I'm I'm doing this
elementwise product. So if any of these
terms here for a unit is very close to
zero, then the pre-activation gradient
is going to be zero for the next layer.
And I highlight this point because
essentially whenever that's something to
think about a lot when you're training
neural nets, whenever this gradient
here, these partial derivatives come
close to zero, that it means the
gradient will not propagate well to the
next layer, which means that you're not
going to get a good gradient to update
your parameters.
Now, when does that happen? When will
you see these terms here being close to
zero? Well, that's going to be when the
partial derivatives of these nonlinear
activation functions are close to zero
or zero. So, uh we can look at the
partial derivatives say of the sigmoid
function. Uh it turns out it's super
easy to compute. It's just the sigmoid
itself times 1 minus the sigmoid itself.
Uh so that means that whenever the
activation of the unit for a sigmoid
unit is close to one or close to zero, I
essentially get a partial derivative
that's close to zero. Uh you can kind of
see it here. The slope here is
essentially flat and the slope here is
flat. That's the uh value of the partial
derivative. So in other words, if my
pre-activations are very negative or
very positive, so if my unit is very
saturated, then gradients will have a
hard time propagating to the next layer.
Okay, that's the key inside here. Um,
same thing for the uh tangent function.
So the turns out the partial derivative
is also easy to compute. You just take
the tangent value, square it and going
to subtract it to one. And you indeed if
it's close to minus one or close to one,
you can see that the slope is flat. So
again, if the unit is saturating,
gradients will propagate have a hard
time propagating to the next layers.
And for the relu the uh rectified linear
activation function the uh gradient is
even simpler. It's uh you just check
whether the pre-activation is greater
than zero. If it is the partial
derivative is one. If it's not it's
zero. So you're actually either going to
multiply by one or zero. You essentially
get a binary mask when you're performing
the propagation through the relu. And
you can see it the the slope here is
flat and otherwise you have a linear
function. So actually here at the the
the shrinking of the gradient towards
zero is even harder. It's exactly
multiplying by zero if you're have a
unit that's uh saturating below.
And beyond all the math uh in in terms
of actually using those in practice
during the weekend you'll see uh three
different libraries that essentially
allows you to compute these gradients
for you. You actually usually don't
write down back prop. you just use all
of these modules that you've implemented
and it turns out there's a way of
automat automatically differentiating uh
your loss function and getting gradients
for free in terms of effort in terms of
programming effort uh with respect to
your parameters. So conceptually the way
you do this and you'll see essentially
three different libraries doing it in
slightly different ways. Um what you do
is you augment your flow graph by adding
at the very end the computation of your
loss function and then each of these
boxes which are conceptually objects
that are taking arguments and computing
a value um you're going to augment them
to also have a method that's a backrop
or bprop method. You'll often see
actually this expression being used
prop. And what this method should do is
that it should take as input what is the
gradient of the loss with respect to
myself and then it should propagate to
its arguments. So the things that its
parents in the flow graph the things it
takes to compute its own value. It's
going to propagate them using the chain
rule what is their gradients with
respect to the loss. So what this means
is that you would sort of start the
process by initializing well the
gradient of the loss with respect to
itself is one and then you pass the brop
method here one and then it's going to
propagate to its argument uh what is by
using the chain rule what is the
gradient of the loss with respect to f
ofx and then you're going to call b prop
on this object here and it's going to
compute well I have the gradient of the
loss with respect to myself f ofx from
this I compute what's the gradient of my
argument which is the pre-activation at
layer 2 uh with respect to the loss. So
I'm going to reuse the computation I
just got and update it using my uh what
is essentially the Jacobian and then I'm
going to take the pre-activation here
which now knows what is the gradient of
the loss with respect to itself the
pre-activation it's going to propagate
to the weights and the biases and the
layer below update them with informing
them of what is the gradient of the loss
with respect to themselves and you
continue like this essentially going
through the flow graph but in the
opposite direction.
So the library torch the basic library
torch essentially functions like this
quite explicitly it you construct you
chain these elements together and then
when you're performing back propagation
you're going in the reverse order of
these chained elements and then you have
libraries like torch autograd and piano
and tensifo which you'll learn about
which are doing things slightly more
sophisticated there and I'll uh you'll
learn about that later on
okay so that's a discussion of how you
actually compute gradients of the loss
with respect to the parameters. So
that's another component we need in
stoastic grain in descent. Uh we can
choose a regularizer. One that's often
used is the L2 regularization. So that's
just the sum of the squared of all the
weights and the gradient of that is uh
just twice times the weight. So it's a
super simple grain to compute. We
usually don't regularize the biases. Um
there's no particularly important reason
for that. It's just uh it there are much
fewer biases so it seems less important.
Um and often this L2 regularization is
often referred to as weight decay. So if
you hear about weight decay that often
refers to L2 regularization
and then finally uh and this is also a
very important point uh you have to
initialize the parameters before you
actually start doing backdrop and there
are a few tricky cases you need to make
sure that you uh don't fall into. So the
biases often we initialize them to zero.
There are certain exceptions but for the
most part we initialize them to zero.
But for the weights there are a few
things we can't do. So we can't
initialize the weights to zero. And
especially if you have tanh activations
um the reason and I won't explain it
here but it's not a bad exercise to try
to figure out why is that essentially
when you do your first pass you're going
to get gradients for all your parameters
that are going to be zero. So you're
going to be stuck at this zero
initialization. So we can do that. Um we
also can't initialize all the weights to
exactly the same value. Um if again you
think about it a little bit, what's
going to happen is essentially that all
the weights coming into a unit within
the layer are going to have exactly the
same gradients, which means they're
going to be updated exactly the same way
and which means they're going to stay
constant the same. Not constant, but
they're going to stay the same the whole
time. So it's as if you have multiple
copies of the same unit. So you
essentially have to break that initial
symmetry that you would create if you
initialized everything to the same
value. So what we end up doing most of
the time is initialize the weights to
some randomly uh generated value. Uh
often we generate them u there are a few
other recipes but one of them is to
initialize them from some uniform
distribution between uh lower and upper
bound. Um this is a recipe here that is
often used that has some theoretical
grounding that's uh was derived
specifically for the tanh. There's this
paper paper here by Xavier Go and Yashu
Benju you can check out for some
intuition as to oh you know how you
should initialize the weights but
essentially they should be initially
random and they should be initially
close to zero random to break symmetry
and uh um close to zero so that
initially the units are not uh already
saturated because if the units are
saturated then there are no gradients
that are going to pass through the units
you're essentially going to get
gradients very close to zero at the
lower layers. So that's the main
intuitions to have weights that are
small and close to zero uh small and
random.
Okay, so those are all the pieces we
need for running stochastic gradient
descent. So that allows us to take a
training set and run a certain number of
epochs and have the neural net learn
from that training set. Now there are
other quantities in our neural network
that we haven't specified how to choose
them. So those are the hyperparameters.
Um so usually we're going to have a
separate validation set. Most people
here are familiar with machine learning.
So that's a typical procedure. And then
we need to select things like okay how
many layers do I want? How many units
per layer do I want? Uh what's the step
size the learning rate of my stoastic
gradient descent procedure that alpha
number uh what is the weight decay that
I'm going to use. So a standard thing in
machine learning is to perform a uh grid
search that is if I have two
hyperparameters I list out a bunch of
values I want to try. So for the number
of hidden units maybe I want to try 100
a thousand and 2,000 say and then for
the learning rate maybe I want to try
0.01 and 0.001. So a grid search would
just try all combinations of these three
values for the hidden units and these
two values for the learning rates. Um so
that means that the more hyperparameters
there are it's the number of
configurations you have to try out uh
blows up and and grows exponentially. So
another procedure uh that is now more
and more common which is more practical
is to perform a form of random search.
In this case what you do is for each
parameter you actually determine a
distribution of likely values you'd like
to try. So it could be um so for the
number of hidden units maybe I do a
uniform distribution over all integers
from 100 to a thousand say or maybe a
log uniform distribution and for the
learning rate maybe again a log uniform
distribution but from 0.001 to 0.01 01
say and then to get an experiment so to
get values for my hyperparameters to do
an experiment with and get a performance
on my validation set I just
independently sample from these
distributions for each hyperparameter to
get a full configuration for my
experiment and then because I have this
way of getting one experiment I do it
independently for all of my jobs all of
my experiments that I will do. So in
this case, if I know I have like enough
compute power to do 50 experiments, I
just sample 50 independent samples from
these distributions for hyperparameters,
perform these 50 experiments, and I just
take the best one. And what's nice about
it is that there are no unlike grid
search, there are never any holes in the
grid. That is, you just specify how many
experiments you do. If one of your jobs
died, well, you just have one less, but
there there's no hole in your
experiment. Um, and also one reason why
it's particularly useful this approach
is that if you have a specific value in
grid search for one of the
hyperparameters that just makes the
experiment uh not work at all. So
learning rates are a lot like this. If
you have a learning rate that's too
high, uh it's quite possible that
convergence of the optimization will not
converge. Well, if you're using a grid
search, it means that for all the
experiments that use that specific value
of the learning rate, they're all going
to be garbage. they're all not going to
be useful and you don't really get this
sort of big waste of computation if you
do random search because most likely all
the values of your hyperparameters are
going to be unique because they're
sampled say from a uniform distribution
over some some range. So that actually
works uh quite well and and and it's
quite recommended and there are more
advanced methods uh like uh methods
based on machine learning basian
optimization and or sometimes known as
sequential model base optimization uh uh
that I won't talk about but that works a
bit better than uh random search um uh
and and that's another alternative if
you think you have an issue finding good
hyperparameters is to investigate some
of these more advanced methods.
Um, now you do this for most of your
hyperparameters, but for the number of
epochs, the number of times you go
through all of your uh examples in your
training set, uh, what we usually do is
not grid search or random search, but we
use a thing known as early stopping. The
idea here is that if I've trained a
neural net for 10 epochs, well, training
a neural net with all the other
hyperparameters kept constant, but one
more epoch is easy. I just do one more
epoch. So I shouldn't try to I shouldn't
start over and then do say 11 epochs
from scratch. And so what we would do is
we would just track what is the
performance on the validation set as I
do more and more epochs. And what we
will typically see is the training error
will go down. Uh but the validation set
performance will go down and eventually
go up. Um the intuition here is that the
gap between the performance on the
training set and the performance on the
validation set will tend to uh increase.
And since the training curve cannot go
below usually some bound uh then
eventually the validation set
performance has to go up or sometimes it
won't necessarily go up but it sort of
stay stable. So with early stopping what
we do is that if we reach a point where
the validation set performance hasn't
improved from some certain number of
iterations which uh we refer to as the
look ahead we just stop we go back to
the neural net that had the best
performance overall in the validation
set and that's my neural network. So I
have now a very cheap way of actually
getting the number of iterations or the
number of epochs over my training set.
Uh a few more tricks of the trade. Uh so
um it's always useful to normalize your
data. It will um often have the effect
of speeding up training. If you have
real value data for binary data that's
uh usually keep it as it is. Uh so what
I mean by that is just subtract for each
dimension what is the average in the
training set of that dimension and then
dividing by the standard deviation of
each dimension again in my input space.
Um so this can speed up training. Um we
often use a decay on the learning rate.
Um there are a few methods for doing
this. One that's very simple is to start
with a large learning rate and then
track the performance on the validation
set. And once on the validation set it
stops improving, you decrease your
learning rate by some ratio. Maybe you
divide it by two and then you continue
training for some time. Hopefully the
validation set performance uh starts
improving and then at some point it
stops improving and then you stop or you
divide again by two. So that sort of
gives you an adaptive using the
validations and an adaptive way of
changing your learning rate and that can
again uh work better than having a very
small learning rate than waiting for a
longer time. So making very fast
progress initially and then slower
progress towards the end.
Um also I've described so far the
approach for training neural nets that
uh is based on a single example at a
time but in practice we actually use
what's called mini batches. That is we
compute the loss function on a small
subset of examples say 64 128 and then
we take the average of the loss of all
these examples in that mini batch. And
uh that's actually we compute the
gradient of this average loss on that
mini batch. The reason why we do this is
that it turns out that um you can very
efficiently implement the forward pass
over all of these 64 128 examples in my
mini batch in one pass by instead of
doing vector matrix multiplications when
we compute the pre-activations uh doing
matrix matrix multiplications which are
faster than doing multiple matrix vector
multiplications. So in your code often
there will be this other hyperparameter
which is mostly optimized for speed in
terms of how quickly training will
proceed uh of the number of examples in
your mini batch. Other things to improve
optimization might be using a thing like
momentum. That is uh instead of using as
the descent direction the gradient of
the loss function, I'm actually going to
track a descent direction which I'm
going to compute as the current gradient
for my current example or mini batch
plus some fraction of the previous
update, the previous uh direction of
update. Uh and beta now is a
hyperparameter you have to optimize. So
what this does is if all the update
directions agree across multiple updates
then it will start picking up momentum
and actually make bigger uh uh steps in
those directions.
And then there are multiple even more
advanced methods for uh having adaptive
types of learning rates. Uh I mentioned
them here very quickly because you might
see them in papers. There's a method
known as adagrad where uh the learning
rate is actually scaled for each des for
each dimension. So for each weight and
each biscase it's going to be scaled by
what is the um square root of the
cumulative sum of the squared gradients.
So what I track is I take my gradient
vector at each step. I do an elementwise
square of all the dimensions of my
gradients my gradient vector and then I
accumulate that in some variable that
I'm noting as gamma here. And then for
my descent direction, I take the
gradient and I do an elementwise
division by the square root of this
cumulative sum of squared gradients. Uh
there's also RMS prop which is
essentially like adagram but instead of
doing a cumulative sum we're going to do
an exponential moving average. So we
take the previous value times some
factor plus one minus this factor times
the current squared gradient. So that's
RMS prop. And then there's atom which is
essentially a combination of RMS prop
with momentum which is more involved and
I won't have time to describe it here
but that's another method that's often
you know actually implemented in these
different softwares and that uh people
seem to use with a lot of success.
And uh finally uh in terms of actually
debugging your implementations um so for
instance if you're lucky you can build
your neural network without difficulty
using the current tools that are
available in torch or tensorflow or tano
but maybe sometimes you actually have to
implement certain gradients for a new
module and a new box in your flow graph
that isn't currently supported. If you
do this you should check that you've
implemented your gradients correctly.
And one way of doing that is to actually
compare the gradients computed by your
code with a finite difference of
estimate. So what you do is for each
parameter you add some very small
epsilon value say 10 to the minus 6 and
you compute what is the output of of
your module. Uh and then you subtract
the same thing but where you've
subtracted the small quantity and then
you divide by 2 epsilon. So if epsilon
is uh converges to zero then you
actually get the partial derivative. But
if it's just small, it's going to be an
approximate. And usually this finite
difference estimate will be very close
to a correct implementation of the real
gradient. So you should definitely do
that uh if you've actually implemented
some of the gradients in your code. And
then another useful thing to do is to
actually do a very small experiment on a
small data set before you actually run
your full experiment on your complete
data set. So use say 50 examples. Uh so
just taking a random subset of 50
examples from your your data set
actually just make sure that your code
can overfitit to that data can
essentially classify it perfectly given
you know enough capacity that you would
think it should get it. Um so if it's
not the case then there's a few things
that you might want to investigate. uh
maybe your initialization is such that
the units are already saturated
initially and so there's no actual
optimization happening because some of
the gradients on some of the weights are
exactly zero. So you want want to check
your initialization. Uh maybe your
gradients are just you know you're using
a model you implemented gradients for
and maybe there are gradients are not
properly implemented. Uh maybe you
haven't normalized your input which
creates some instability making it
harder for stocastic gradient and ascent
to uh uh uh work successfully. Uh maybe
your learning rate is too large then you
should consider trying smaller learning
rates. That's actually a pretty good way
of having a some idea of the uh
magnitude of the learning rate you
should be using and um and then once you
actually over fit in your small training
set you're ready to do a full experiment
on on a larger data set. That said, this
is not a replacement for gradient
checking. So, um, backrop is and
stocastic gradient descent, it's a great
algorithm that's very bug resistant. Uh,
you will potentially see some learning
happening even if some of your gradients
are wrong or say exactly zero. So, you
should that's great. You know, if you're
an engineer and you're implementing
things, it's fun when code is somewhat
bug resistant, but if you're actually
doing science and trying to understand
what's going on, that's can that can be
a complication. So do do both uh
gradient checking and a small experiment
like that.
All right. And so for the last few
minutes, I'll actually try to motivate
what you'll be learning quite a bit uh
about in the next uh two days. Uh that
is the specific case for deep learning.
So I've already told you that if I have
a neural net with enough hidden units,
theoretically I can potentially
represent pretty much any function, any
classification function. So why would I
want multiple layers? So there are a few
motivations behind this. The first one
is taken directly from our own brains.
So we know in the visual cortex that the
light that hits our retina eventually
goes through several regions in the
visual cortex eventually reaching an
area known as V1 where you have units
that are or neurons that are essentially
tuned to small forms like edges. uh and
then it goes on to V4 where it's
slightly more complex patterns that the
units are are tuned for and then you
reach AIT where you actually have
neurons that are specific to certain
objects or certain units. And so the
idea here is that perhaps that's also
what we want in an artificial say uh you
know vision system. We'd like it if it's
detecting faces to have a first layer
that detects simple edges and then
another layer that perhaps puts these
edges together detecting slightly more
complex things like a nose or a mouth or
eyes and then eventually have a layer
that combines these slightly less
abstract uh or more abstract uh uh units
to get something even more abstract like
a complete face.
There's also some theoretical
justification for doing uh using
multiple layers. Um so the early results
were mostly based on studying boolean
functions or a function that takes as
input can think of it as a vector of
just zeros and ones and uh you could
show that there are certain functions
that um if you had a essentially a
boolean neural network or uh essentially
a boolean circuit and you restricted the
number of layers of that circuit that
there are certain functions that in this
case to represent certain boolean
functions exactly you would need an
exponential number of units in each of
these layers. Whereas if you allowed
yourself to have multiple layers, then
you could represent these functions more
compactly. And so there's that's another
motivation that perhaps with more
layers, we can represent fairly complex
functions in a more compact way.
And then there's the reason that they
just work. So we've seen in the past few
years great success in speech
recognition where it's essentially
revolutionized the field where
everyone's using deep learning for
speech recognition and same thing for
visual object recognition uh where again
deep learning is sort of the method of
choice for identifying objects in
images.
So then why are we doing this only
recently? Why didn't we do deep learning
way back when uh back prop was invented
which is uh essentially in 1980s and
even before that. Um so it turns out
training deep neural networks is
actually not that easy. There are a few
hurdles that one can be confronted with.
Uh I've already mentioned one of the
issue which is that um some of the
gradients might be fading as you go from
the top layer to the bottom layer
because we keep multiplying by the
derivative of the activation function.
So that makes trending hard. It could be
that the lower layers have very small
gradients are barely moving and
exploring the space of uh correct you
know features to learn for a given
problem. Sometime sometimes that's the
problem you find you have a hard time
just fitting your data and you're
essentially underfitting
or it could be that with you know deeper
neural nets or bigger neural nets we
have more parameters. So perhaps
sometimes we're actually overfitting.
We're in a situation where all the
functions that we can represent with the
same neural net represented by this gray
area function actually includes yes the
right function but it's so large that
for a finite training set the odds that
I'm going to find the one that's close
to the true classifying function the
real system that I'd like to have is
going to be very different. So in this
case I'm in I'm essentially overfitting
and that might also be a situation we're
in. And unfortunately
there's never there are many situations
where one problem is observed
overfitting or underfitting. Um and so
we essentially have you know in the
field develop tools for fighting both
situations and I'm going to rapidly
touch a few
Resume
Read
file updated 2026-02-13 13:23:17 UTC
Categories
Manage