Foundations and Challenges of Deep Learning (Yoshua Bengio)
11rsu_WwZTc • 2016-09-27
Transcript preview
Open
Kind: captions
Language: en
Thank You Sammy so I'll tell you about
some very high-level stuff today no new
algorithm some of you already know about
the book that Ian Goodfellow erinkoval
and I have written and it's now in
presale by MIT press I think you can
find it on Amazon or something and paper
the actual shipping is going to be in
December hopefully for nibs
so we've already heard that story at
least well from several people here at
least from Andrew I think but it's good
to ponder a little bit some of these
ingredients that seem to be important
for deep learning to succeed but in
general for machine learning to succeed
to learn really complicated tasks of the
kind we want to reach human level
performance so if a machine is going to
be intelligent it's going to need to
acquire a lot of information about the
world and the big success of machine
learning for AI has been to show that we
can provide that information through
data through examples but but really
think about it you know that that
machine will need to know a huge amount
of information about the world around us
this is not how we're doing it now
because we're not able to train such big
models but it will come one day and so
we'll need models that are much bigger
than the ones we currently have of
course that means machine learning
algorithms that can represent
complicated functions that's you know
one good thing about neural nets but
there are many other machine learning
approaches that allow you in principle
to represent very flexible forms like
nonparametric methods classical
nonparametric methods or svms but
they're going to be missing 0.4 and
potentially 0.5 depending on the methods
point 3 of course
you you need enough computing power to
train and use these big models and
point-five just says that it's not
enough to be able to train the model you
have to be able to use it in a
reasonably efficient way from a
computational perspective this is not
always the case with some probabilistic
models where inference in other words
answering questions having the computer
do something can be intractable and then
you need to do some approximations which
could be efficient or or not now the
point I really want to talk about is the
fourth one how do we defeat the curse of
dimensionality in other words if you
don't assume much about the world it's
actually impossible to learn about it
and and so I'm going to tell you a bit
about the assumptions that are behind a
lot of deep learning algorithms which
make it possible to work as well as we
are seeing in practice in the last few
years something wrong Microsoft bug okay
so how do we bypass the curse of
dimensionality the curse of
dimensionality is about the
exponentially large number of
configurations of the space variables
that we want to model the number of
values that all of the variables that we
observe can take is going to be
exponentially large in general because
there's a compositional nature if if
each pixel can take two values and you
got a million pixels then you got two to
one million number of possible images so
the only way to beat an exponential is
to use another exponential so we need to
make our models compositional we need to
build our models in such a way that they
can represent functions that look very
complicated but yet these models need to
have a reasonably small number of
parameters reasonably small in the sense
compared to the number of configurations
of the variables the number of
parameters should be small and we can
achieve that by by composing little
pieces together composing layers
together it can put composing units on
the same layer together and that's
essentially what's happening with deep
learning so you actually have two kinds
of compositions there's the the
compositions happening on the same layer
this is the idea of distributed
representations which I'm going to try
to explain a bit more this is what you
get when you learn embeddings for
forwards or for images representations
in general and then there's the idea of
having multiple levels of representation
that's the notion of depth and there
there is another kind of composition
takes place whereas the
the first one is a kind of parallel
composition I'm you know I can choose
the values of my different units
separately and then they together
represent an exponentially large number
of possible configurations in the second
case there's a sequential composition
where I take the output of one level and
and I combine them in new ways to build
features for the next level and so on
and so on right so so the reason deep
learning is working is because the world
around us is better modeled by making
these assumptions it's not necessarily
true that deep learning is going to work
for any machine learning problem in fact
if if we consider the set of all
possible distributions that we would
like to work from deep learning is no
better than any other and that's this is
basically what the no free lunch theorem
is saying it's because we are incredibly
lucky that we live in this world which
can be described by using composition
that these algorithms are working so
well this is important to really
understand this
so before I go a bit more into
distributed representations let me say a
few words about non distributor
presentations so if you're thinking
about things like clustering engrams for
language modeling classical nearest
neighbors SVM's with Gaussian kernels
classical nonparametric models with
local kernels and decision trees all
these things the way these algorithms
really work is actually pretty
straightforward if you you know cut the
crap and hide the math and try to
understand what is going on they they
look at the data in in data space and
they break that space into regions and
they're going to use different free
parameters for each of those regions to
figure out what the right answer should
be the right answer it doesn't have to
be supervised learning even an S
provides I think there's a right answer
it might be the density or something
like that
okay and you might think that that's the
only way of solving a problem you know
we consider all of the cases and we have
an answer for each of the cases and we
can maybe interpolate between those
cases that we've seen the problem with
this is somebody comes up with a new
example which isn't in between two of
the examples we've seen something that a
la requires us to extrapolate something
that's you know non-trivial
generalization and and these algorithms
just fail they don't they don't really
have a recipe for saying something
meaningful away from the training
examples there's another interesting
thing to note here which I would like to
you to keep in mind before I show the
next slide which is in red here which is
we can do a kind of simple counting to
relate the number of parameters a number
of free parameters that can be learning
and the number of regions in the data
space that we can distinguish so here we
basically have linear relationship
between these two things right so for
each region I'm going to need at least
something like some kind of Center for
the region and
maybe if I need to output something I'll
lean an extra set of parameters to tell
me what the answer should be in that
area so the number of parameters grows
linearly with the number of regions that
I I'm going to be able to distinguish
the good news is I can have any kind of
function right so I can break up the
space in any way I want and then for
each of those regions I can have any
kind of output that I need so for
decision trees the regions would be you
know splitting across axes and so on and
for this is more like four nearest
neighbor or something like that
now another bug I don't think I will
send this hope works this time oh I have
a another option sorry about this okay
so so here's the the point of view of
the suit representations for solving the
same general machine learning problem we
have a data space and we want to break
it down but we're going to break it down
in a way that's not general we're going
to break it down in a way that makes
assumptions about the data but it's
going to be compositional and it's going
to allow us to you know be exponentially
more efficient so how are we going to do
this so in the picture on the right what
you see is a way to break the input
space by the intersection of half-planes
and this is the kind of thing you would
have with a what happens at the first
layer of a neural net so here imagine
the input is 2-dimensional so I can plot
it here and I have three binary hidden
units c1 c2 c3 so because they're binary
you can think of them as little binary
classifiers and because it's only a one
layer net you can think of what they're
doing is a linear classification
and so those colored hyperplanes here
are the decision surfaces for each of
them now these three bits there can take
they can take eight values right
corresponding to you know whether each
of them is on or off and and those
different configurations of those bits
correspond to actually seven regions
here because there's one of the eight
regions which does is not feasible so so
now you see that we're defining a number
of regions which is corresponding to all
of the possible intersections of the
corresponding half-planes and and now we
can play the game of how many regions do
we get for how many parameters and what
we see is that as if we played the game
of growing the number of dimensions
features and also of inputs we can get
an exponentially large number of regions
which are all of these intersections
right there's an exponential number of
these intersections corresponding to
different binary configurations yet the
number of parameters grows linearly with
the number of units so it looks like
we're able to express a function then on
top of that I could imagine you have a
linear classifier right that's that's
the one hidden layer new on that so so
the number of parameters grows just
linearly with the number of features but
the number of regions that the network
can really provide a different answer to
grows exponentially so this is very cool
and the reason it's very cool is that it
allows those neural nets to generalize
because while we're learning about each
of those features we can generalize to
regions we've never seen because we've
learned enough about each of those
features separately I'm going to give
you an example of this in a couple of
slide actually it's let's do it first so
so think about those features so the
input is an image of a person and think
of those features as things like I have
a detector
that says that the person wears glasses
and I have another unit that's detecting
that the person is a female or male and
I have another unit that texts that the
person is a child or not and you can
imagine you know hundreds or thousands
of these things of course so so the good
news is you could imagine learning about
each of these feature detectors these
little classifiers separately in fact
you could do better than that you could
share you know intermediate layers
between the input and those features but
but let's you know take even the worst
case and imagine we were to train those
separately which is the case in the
linear model that I show before we have
a separate set of parameters for each of
these detectors so if I have n features
each of them say needs order of K
parameters then I need order of NK
parameters and I need order of NK
examples and one thing you should know
from you know which machine learning
theory is that if you have order of P
parameters you need order of P examples
to do a reasonable job of jaw's age of
journalizing you can you can get around
that by regularizing and effectively
having less degrees of freedom but but
you know to keep things simple you need
about the same number of examples or
maybe a hundred times more or ten times
more as the number of really free
parameters so so now the relationship
between the number of regions that I can
represent and the number of examples I
need is quite nice because the number of
regions is going to be to to the number
of features of these binary features so
you know a person could wear glasses or
not be a female or a male or child or
not and I could have a hundred of these
things and I could probably recognize
reasonably well all of these two to the
100 configurations of people even though
I've obviously not seen all of those to
do 100 configurations why is it that I'm
able to do that I'm able to do that
because the the models can learn about
each
these binary features kind of
independently in the sense that I don't
need to see every possible configuration
of the other features to know about
wearing glasses like I can learn about
wearing glasses even though I've never
seen somebody who was a female and a
child and chubby and had you know yellow
shoes and and and I have seen enough
examples of people wearing glasses I can
learn about wearing glasses in general I
don't need to see all of the
configurations of the other features to
learn about one feature okay and so so
this is really what what you know why
this thing works is because we're making
assumptions about the data that those
features are meaningful by themselves
and you don't need to actually have data
for each of the regions the exponential
number of regions in order to learn the
proper way of detecting or lore of
discovering these these these
intermediate features let me add
something here there were some
experiments recently actually showing
that this kind of thing is really
happening
because the features I was talking about
not only I'm assuming that they exist
but the the optimization methods or
training procedures discover them they
can learn them and this is an experiment
that's been done in 2012 all Tour Alba's
lab at MIT where they trained a usual
confidence to recognize places so the
outputs of the net are just the types of
places like is this a beach scene or an
office scene or street scene and so on
but but then the the thing they've done
is they ask people to analyze the the
hidden units to try to figure out what
each hidden unit was doing and they
found that there's a large proportion of
units that humans can find a pretty
obvious interpretation for what those
units like so so they see a bunch of
units which you know
like people are different kinds of
people or animals or buildings or
seedings or tables lighting and so on so
it's like if indeed the those neural
nets are discovering semantic features
they are semantic because actually
people give them names as the
intermediate features you know in order
to reach the final goal of here
transpiring scenes and the reason
they're generalizing is because now you
can combine those features in an
exponentially large number of ways right
you could have a scene that has a table
different kind of lighting some people
you know maybe a pet and and you can say
something meaningful about the
combinations of these things because the
network is able to learn all of these
features without having to see all of
the possible configurations of them so I
don't know if my explanation makes sense
to you but now is the chance to ask me a
question all clear usually it's not yeah
with decision trees right to some extent
so if the question is can't we do the
same thing with a set of decision trees
yeah in fact this is one of the reasons
why forests work better or bagged trees
work better than single trees forests or
actually or Bank trees are like one
layer one level deeper than a single
trees but but they still don't have as
much of a sort of distributed aspect as
neural nets so they be and and usually
they're not trained jointly I mean
boosted trees are you know to some
extent in a greedy way but yeah any
other question
yeah cases where what non-conditional
non computer vision non compositional I
don't understand the question
I mean I don't sound what you mean what
do you mean non compositional yeah it's
everywhere around us I don't think I
don't think that there are examples of
neural nets that really work well where
the data doesn't have some kind of
compositional structure in it but if you
come up with an example I'd like to hear
about it
okie s yes
to think about this issue in graphical
model terms is is if it can be done but
you have to think about not feature
detection like I've been doing here but
about generating an image or something
like that right then it's easier to
think about it so so the same kinds of
things happen if you think about how I
could generate an image if you think
about underlying factors like which
objects where they are what's their
identity what's their size these are all
independent factors which you compose
together in in funny ways if you were to
do a graphics engine you can see exactly
what those ways are and it's much much
easier to represent that joint of
distribution using this compositional
structure then if you're trying to work
directly in the pixel space which is
normally what you would do with a
classical nonparametric method and it
wouldn't work but if you look at our
best D generative models now for images
for example like ganz or V AES they're
really you know we're not there yet but
they're amazingly better than anything
that people could dream up just a few
years ago in in machine learning okay
let me move on because of other things
to talk about so this is all kind of
hand wavy but some people have done some
math around these ideas and and so for
example there's one result from two
years ago I clear where we study the
single layer case and we consider a
network with rectifiers rellis and we
find that the the network of course
computes a piecewise linear function and
so one way to quantify the richness of
the function that it can compute I was
talking about regions here but well you
can do the same thing here you can count
how many pieces does does this
network have in its input to output
function and and it turns out that is it
six potential in in the number of inputs
well it's a number of units to the power
number of inputs so that's for a sort of
district representation there's this an
exponential kicking in we also studied
the the depth aspect so what you need to
know about depth is that there's a lot
of earlier theory that says that a
single layer is sufficient to represent
any function however that theory doesn't
specify how many units you get you might
need and in fact you might need an
especially large number of units so what
several results show is that there are
functions that can be represented very
efficiently with few units so few
parameters if you allow the network to
be deep enough so out of all the
functions again it's a luckiness thing
right out of all the functions that
exists there's a very very small
fraction which happen to be very easy to
represent with a deep network and if you
try to represent these these functions
with a shallow network you're screwed
you're going to need an exponential
number of parameters and so you're gonna
need an exponential number of examples
to learn these things but again we're
incredibly lucky that the function we
want to learn have this property but in
the sense it's not surprising I mean we
use this kind of compositionality and
depth everywhere we when we write a
computer program we just don't have like
a single main we have you know functions
and call functions and and we were able
to show similar things as what I was
telling you about for the single layer
case that as you increase depth for
these deep relu networks the number of
pieces in the piecewise linear function
grows exponentially with the depth so so
it's it's already exponentially large
with a single-layer but it gets
exponentially even more with a deeper
net okay so so this this was a topic of
representation of functions why why deep
learn deep architectures can can be very
powerful if we're lucky and we seem to
be looking the other another topic I
want to mention that's kind of very much
in the foundations is how is it that
we're able to train these neural nets in
the first place in the 90s many people
decided to not do any more research on
your nuts because there were 30 Korra's
ult's showing that there are really an
exponentially large number of local
minima in the training objective in of a
neural net so in other words the
function we want to learn has many of
these holes and if we start at a random
place well what's the chance we're going
to find the best one the the one that
corresponds to a good cost and that was
one of the motivations for people who
flocked into a very large area of
research in machine learning in the 90s
and 2000's based on algorithms that
require on the convex optimization to
Train because of course if we can do
context optimization we eliminate this
problem if if the objective function is
convex in the parameters then we know
there's a single global minimum right so
let me show you a picture here you get a
sense of if you look on the right hand
top this is if you draw a random
function in 1d or 2d or 3d like here
this is a kind of a random smooth
function in 2d you see that is going to
have many ups and downs this is a local
minimum and but but the good news is
that in high dimension it's a totally
different story so what are the
dimensions here we're talking about the
parameters of the model and the vertical
axis is the cost we're trying to
minimize
and what happens in high dimension is
that instead of having a huge number of
local minima on our way when we're
trying to optimize what we encounter
instead is a huge number of saddle
points so saddle point is like the thing
on the bottom right in in 2d so you have
two parameters and y-axis is the cost
you want to minimize and so what you see
in a saddle point is yeah you have
dimensions or directions where the the
objective function draws a a minimum so
there's like a curve that it curves up
and in other directions it curves down
so we are you know saddle point has both
a minimum in some direction and a
maximum in other directions so this is
this is interesting because even though
it's a these these points like saddle
points and many more are places where
you could get stuck in principle if
you're exactly at the subtle point you
don't move but if you move a little bit
away from it you will go down the saddle
right so what what our work in the other
paper other work from NYU
tremonica and collaborators of Yann
Locker showed is that actually in very
high dimension not only you know it's
it's the issue is more saddle points
than local minima but but the local
minima are good so let me try to explain
what I mean by this so let me show you
actually first an experiment from from
the NYU guys so they did an experiment
where they gradually change the size of
the neural net and they they look at
what looks like local minima but they
could be you know saddle points that are
the lowest that they could obtain by
training and what you're looking at is a
distribution of errors they get from
different initialization
of their training and so what happens is
that when the network is small like the
pink here on the right there's a
widespread distribution of cost that you
can get depending on where you you you
start and they're pretty high and if you
increase the size of the network it's
like all of the local minima that you
find concentrate around a particular
costs so you don't get any of these bad
local minima that you would get with a
small Network they're all kind of pretty
good and if you increase even more the
size of network this is like a single
hidden layer network you know not very
complicated
this phenomenon increases even more in
other words they all kind of converge to
the same kind of costs so let me try to
explain what's going on so if we go back
to the picture of the saddle point but
instead of being in 2d imagine you are
in a million D and in fact you know
people have billion D networks these
days I'm sure andrew has even bigger
ones I'm not sure but so what happens in
this very high dimensional space of
parameters is that if if things are not
really you know really bad for you so if
you imagine a little bit of randomness
in the way the problem is set up and
there it seems to be the case in order
to have a true local minimum you need to
have the curvature going up like this in
all the you know billion directions so
if there is a certain probability of
this event happening that all know that
this particular directions is curving up
and this one is grabbing up the
probability that all of them curve up
becomes exponentially small so we we
tested that experimentally what you see
in the bottom left is a curve that shows
the training error as a function of
what's called the index of the critical
point which is just the fraction of the
directions which are
curving down right so so 0% would mean
it's a local minimum a hundred percent
would be it's a local maximum and
anything in between is a saddle point so
what we find is that as training
progresses we're going close to a bunch
of saddle points and these and none of
them are local minima otherwise we would
be stuck and and in fact we never
encounter local minima until we reach
the lowest possible cost that we were
able to get in addition there is a
theory suggesting that so the the local
the low the the local minima will
actually be close in cost to the global
minimum they will be above and they will
concentrate in a little band above the
global minimum but that band of local
minima will be close to the global
minimum and and the larger 2-dimension
the more this is going to be true so as
you go to go back to my analogy right at
some point of course you will get local
minima even though it's unlikely when
you're in the middle when you get close
to the bottom well you can't go lower so
you know it has to rise up in all the
directions but it's yeah so that's kind
of good news I think in spite of this I
don't think that the optimization
problem of neural nets is solved there
are still many cases where we find
ourselves to be stuck and we still don't
understand what the landscape looks like
this set of beautiful experiments by in
Goodfellow that help us visualize a bit
what's going on but I think one of the
open problems of optimization for neural
nets is we know what does the landscape
actually look like it's hard to
visualize of course because it's very
high dimensional but for example we
don't know what those saddle points
really look like when we actually
measure the gradient near those when
we're approaching those saddle points is
it's not close to zero so we never go to
actually flat places
this may be too due to the fact that
we're using SGD and it's kind of
hovering above things there might be
conditioning issues or even if you are
at a cell nearer saddle point you might
be stuck even though it's not a local
women because in many directions
it's still going up maybe you know 95%
of the directions and and the other
directions are hard to reach because
simply there's a lot more curvature in
some directions and other directions and
that's you know the traditional ill
conditioning problem we don't know
exactly you know what what's making it
hard to try in some some networks
usually continents are pretty easy to
train but when you go into things like
machine translation or even worse
reasoning tasks like with things like
you know Turing machines and things like
that it gets really really hard to train
these things and people have to use all
kinds of tricks like curriculum learning
which are essentially optimization
tricks to make the optimization easier
so I don't want to tell you that all the
optimization problem of neural nets is
easy it's done we don't need to worry
about it but it's much easier and less
of a concern than what people thought in
the 90s ok so so was she learning I mean
deep learning is moving out of pattern
recognition and into more complicated
tasks for example including reasoning
and and and combining deep learning with
reinforcement learning planning and
things like that
you've heard about attention that's one
of the tools that is really really
useful for many of these tasks we've
sort of come up with attention
mechanisms as not a way to focus on
what's going on in the outside will like
we usually think of attention like
attention in the visual space but
internal attention right in the space of
representations that have been built so
that's what we do here in machine
translation and it's been extremely
successful as quark said so I'm not
going to show you any of these pictures
blah blah another so I'm getting more
now into the domain of challenges a
challenge that I've been working on
since I was a baby researcher as a PhD
student is long-term dependencies and
recurrent Nets and although we've made a
lot of progress this is still something
that we haven't completely cracked and
it's connected to the optimization
problem that I told you before but it's
a very particular kind of optimization
problem so some of the ideas that we've
used to try to make the propagation of
information and gradients easier include
using skip connections over time include
using multiple time scales there's some
recent work in this direction from from
my lab and other groups and even the
attention mechanism itself you can think
of a way to help dealing with with long
term dependencies so the way to see this
is to think of the place on which we're
putting attention as part of the state
right so so imagine really you have a
recurrent net and it has two kinds of
state it has the usual recurrent net
state but it has the content of the
memory you know Kwok told you about
memory nets and neural Cheng machines
and the full state really includes all
of these things and and now we are able
to read or write from that memory I mean
the little recurrent net is able to do
that so what happens is that there are
memory elements which don't change or
time maybe they're being written once
and and so the information that has been
stored there it can stay for as much
time as you know they're not going to be
overwritten so so that means that if you
consider the gradients back propagated
through those cells they can go pretty
much unhampered and there's no vanishing
gradient problem so this is something
that to be that that view of the problem
of long-term dependence
sieze with memory i think is could be
very useful all right
in the last part of my presentation I
want to tell you about what I think is
the biggest challenge ahead of us which
is unsupervised learning any question
about attention and memory before I move
on to and provides learning ok so why do
we care about unsupervised learning it's
not working well actually it's working a
lot better than it was but it's still
not something you find in industrial
products at least not in an obvious way
there are less obvious ways where
unsupervised learning is actually
already extremely successful so for
example when you train word embeddings
with word to Veck or any other model and
you use that to pre train like we did
our machine translation systems or other
kinds of NLP tasks you're you're
exploiting as provides learning even
when you train a language model that
you're going to stick in some other
thing or pre train something with that
you're also doing unsupervised learning
but I think the potential of and the
importance of ents provides learning is
is usually underrated so why do we care
first of all the idea of ins provides
learning is that we can train we can we
can learn something from large
quantities of unlabeled data that humans
have not curated and we have lots of
that humans are very good at learning
from unlabeled data I have an example
that I used often that is makes it very
very clear that for example children can
learn all kinds of things about the
world even though no one no no no adult
ever tells them anything about it until
much later when is too late
physics so you know a two or three year
old understands physics you know if she
has a ball she knows what's gonna happen
when she drops the ball she knows you
know how liquids behave she knows all
kinds of things about objects and an
ordinary Newtonian physics even though
she doesn't have explicit equations and
a way to destroy them with words but she
can predict what's going to happen next
right and the parents don't tell the
children you know force equals mass
times acceleration right so this is
purely unsupervised and it's very
powerful we don't even have that right
now we don't have computers that can
understand the kinds of physics that
children can understand so it looks like
it's a skill that humans have and that's
very important for humans to make sense
of the world around us but we haven't
really yet succeeded to put in machines
let me tell you other reasons that are
connected to this why unsupervised
learning to be useful when you do
supervised learning essentially the way
you train your system as you you you you
focus on a particular task those here's
the inputs and here's the the input
variables and here's an output variable
that I would like you to predict given
the input your learning P of Y given X
but if you're doing as provides learning
essentially you're learning about all
the possible questions that could be
asked about the data of your observe so
it's not that you know there's X 1 X 2 X
3 and Y everything is an X and you can
predict any of the X given any of the
other X right if I give you a picture
and I had a part of it you can guess
what's missing if I hide if I hide the
you know the caption you can generate
the caption given the image if I hide
hide the image and I give you the
caption you can you can you know guess
what the image would be or draw it or
figure out you know from examples which
one is the most appropriate so you can
answer any questions about the data when
you have captured the Joint Distribution
between them essentially so that's that
could be useful
another practical thing that ins
provides learning has been used in fact
this is how the whole deep learning
thing started is that it could be used
as a regular Iser
because in addition to telling our model
that we want to predict Y given X we're
saying find representations of X that
both predict Y and somehow capture
something about the distribution of X
know the leading factors the explanatory
factors of X and this again is making an
assumption about the data so we can use
that as a regular Iser if the assumption
is valid that the essentially the
assumption is that the factor Y that
we're trying to predict is one of the
factors that explain X and that by doing
this provides learning to discover
factors that explain X we're going to
pick Y among the other factors and so
it's going to be much easier now to do
supervised learning of course this is
also the reason why transfer learning
works because there are underlying
factors that explain the inputs for a
bunch of tasks and maybe a different
subset of factors explained are relevant
for one task and another subset of
factors is relevant for another task but
if these factors overlap then there's a
potential for synergy you know by doing
multi task learning so the reason multi
task learning is working is because
unsupervised learning is working is
because there are representations and
factors that explain the data that can
be useful for our supervised learning
tasks of interest that also could be
used for domain adaptation for the same
reason um the other thing that people
don't talk about as much about
unsupervised learning and I think it was
part of the initial success that we had
with stacking auto-encoders and rbms is
that you can actually make the
optimization problem of training deep
nets easier because if you're gonna
you know for the most part if you're
gonna train a bunch of RBMS or a bunch
of voto encoders and I'm not saying this
is the right way of doing it but you
know it captures some of the spirit of
what ins provides learning does a lot of
the learning can be done locally you're
trying to extract some information
you're trying to discover some
dependencies that's that's a local thing
once you have a slightly better
representation we can again tweak it to
extract better more independence or
something of that so so there's a sense
in which the optimization problem might
be easier if you have a very deep net
another reason why we should care about
unsupervised learning even if our
ultimate goal is to do supervised
learning is because sometimes the output
variables are complicated
they are compositional they have a Joint
Distribution so in machine translation
which we talked about the output is a
sentence the sentence is a set of as a
couple of words that have a complicated
Joint Distribution given the input in
the other language and so it turns out
that many of the things we discover by
exploring unsupervised learning which is
essentially about capturing joint
distributions can be often used to deal
with these structured output problems
where you you have many outputs that
form a you know compositional
complicated distribution there's another
reason why unsupervised learning I think
is going to be really necessary for AI
model-based reinforcement learning so I
think I have another slide just for this
let's think about self-driving cars is
very popular topic these days how did I
learn that I shouldn't do some things
with the wheel that will kill myself
right when I'm driving because I haven't
experienced these states where I get
killed and I simply haven't done it like
a thousand times to get learn how to
avoid it
so supervised learning where we're our
rather you know traditional
reinforcement learning like
and policy learning kind of thing or
actor critic or things like that won't
work because I need to generalize about
situations that I'm never going to
encounter because otherwise if I did I
would die so these are like dangerous
states that I need to generalize about
these states but I you know can't have
enough data for them and and I'm sure
there are lots of machine learning
applications where we would be in that
situation I remember a couple of decades
ago I you know I've got some data from
nuclear plant and so you know they
wanted to predict that you know when
it's gonna blow up to avoid it so I said
how many how many yeah it's at zero
right so you see sometimes it's hard to
do supervised learning because the data
you would like to have you can't have
it's it's it's data that you know
situations that are very rare or you
know so how can we possibly solve this
problem well the only solution I can see
is that we learn enough about the world
that we can predict how things would
unfold right when I'm driving you know I
have a kind of mental model of physics
and how cars behave that I can figure
out you know if I turned right at this
point I'm going to end up on the wall
and it's going to be very bad for me and
I don't need to actually experience that
to know that it's bad I can make a
mental simulation of what would happen
so I need a kind of generative model of
how the world would unfold if I do such
and such actions and unsupervised
learning is sort of the ideal thing to
do that but of course it's going to be
hard because we're going to have to
train models that capture a lot of
aspects of the world in order to be able
to learn to generalize properly in those
situations even though they don't see
any data of it
so that's that's one reason why I think
reinforcement learning needs to be
worked on more so I have a little thing
here I think people who have been doing
deep learning can collaborate with
people who are doing reinforcement
learning and not just by providing a
black box that they can use in their
usual algorithms I think there are
things that we do in supervised deep
learning that orange provides deep
learning that can be useful in sort of
rethinking our enforcement learning so
so one example also so well one thing I
really like to think about is credit
assignment in other words how do
different machine learning algorithms
figure out what the hidden units are
supposed to do what the intermediate
computations or the intermediate actions
should be this is what credit assignment
is about and that prop is the best
recipe we currently have for doing
credit assignment it tells the you know
parameters of some intermediary should
change so that the costs much much later
you know hundred steps later if it's a
recurrent net should be reduced so we
could probably use some inspiration from
backrub and how it's used to improve
reinforcement learning and one such cue
is how when we do supervised backprop
say we don't predict the expected loss
that we're going to have and then try to
minimize it where the expectation would
be over the different realizations of
the correct class that's not what we do
but this is what people do in RL they
they will learn a critic or a cue
function which is the expected learning
the expected value of the future reward
or the future loss in our case that
might be you know minus log probability
of the correct answer given the input
and then they will backdrop
through this or use it to estimate the
gradient on the actions instead when we
when we do supervised learning we're
going to do credit assignment where we
use the particular observations of the
correct class that actually happened for
this X right we have X we have Y and we
use the Y to figure out what how to
change our prediction or action so it
looks like this is something that should
be done for our L and in fact we we have
a paper on something like this for a
sequence prediction this is this is the
kind of work which is at the
intersection of dealing with structured
outputs reinforcement learning and
service learning so I think there's a
lot of potential benefit of changing the
frame of thinking that people in the RL
have had for many decades people in RL I
mean not thinking about the world in
with the same eyes as people doing your
net they've been thinking about the
world in terms of discrete states that
could be enumerated and proving theorems
about these algorithms that depend on
essentially you know collecting enough
data to fill all the possible
configurations of the state and their
you know the corresponding effects on
the reward when you start thinking in
terms of neural nets and deep learning
the way to approach problems is very
very different okay let me continue
about as provides learning and why this
is so important if you look at the kinds
of mistakes that our current machine
learning algorithms make you find that
our our neural nets are just cheating
they're using the wrong cues to try to
produce the answers and sometimes it
works sometimes it doesn't work so how
can we make our our models be you know
smarter make less mistakes well
the only solution is to make sure that
those models really understand how the
world works at least at the level of
humans to get human level accuracy human
level performance it may be not
necessary to do this for a particular
problem you're trying to solve so maybe
we can you know get away with doing
speech recognition without really
understanding of the meaning of the
words probably that's going to be okay
but for other tasks especially those
involving language I think having models
that actually understand how the world
tix is going to be very very important
to so how could we have machines that
understand how the world works well one
of the ideas that I've been talking a
lot about in the last decade is that of
disentangling factors of variation this
is related to a very old idea in pattern
recognition computer vision called
invariance the idea of invariance was
that we would like to compute or design
initially design and now learn features
say of the image that are invariant to
the things we don't care about maybe we
want to do object recognition so we
don't care about position or orientation
so we would like to have features that
are translation invariant rotation
invariant scaling invariant whatever so
this is what invariance is about but
when you're in the business of doing
ends provides learning of trying to
figure out how the world works it's not
good enough to do two extracting variant
features what we actually want to do is
to extract all of the factors that
explain the data so if we're doing
speech recognition we want not only to
extract the phonemes but we also want to
figure out you know what kind of voice
is that maybe who is it what kind of
recording conditions or what kind of
microphone is it in a car is it outside
all that information which you're trying
to get rid of normally you actually want
to learn about so that you'll be able to
generalize even to new tasks for example
maybe the next day I'm not going to ask
you to recognize phonemes but recognize
who's speaking more generally if we're
able to disentangle these
that explained how the data varies
everything becomes easy especially if
those factors now can be generated in an
independent way and to generate the data
we we can for example we can learn to
answer a question that only depends on
one or two factors and basically we
eliminate all the other ones because
we've separated them so a lot of things
become much easier so that's one notion
right we can design tangle factors
there's another notion which is the
notion of multiple levels of abstraction
which is of course at the heart of what
we're trying to do with deep learning
and the idea is that we can have
representations of the world
representation of the data as you know
description that involves factors are
features and we can do that at multiple
levels and there are more abstract
levels so if I'm looking at a document
you know there's the level of the pixels
the level of the strokes the level of
the characters the level of the words
and maybe the level of the meaning of
individual words and we actually have
you know systems that will recognize
from a scanned document all of these
levels when we go higher up we're not
sure what the right levels are but
clearly there must be representations of
the meaning not just of single words but
of you know sequences of words and the
whole paragraph what's the story and why
is it important to represent things in
that way because higher levels of
abstraction are representations from
which it is much easier to do things to
answer questions so the the more
semantic levels mean basically we can
very easily act on the information when
it's represented that way if you think
about the level of words it's much
easier to check whether a particular
word is in the document if I have the
words extracted then if I have to do it
from the pixels and if I have to answer
a complicated question about you know
the intention of the person working at
level of words is not high enough it's
not abstract enough I need to work at a
more
abstract level which in which maybe the
same notion could be represented with
many different types of words where many
different sentences could express the
same meaning and I want to be able to
capture that meaning so the last slide I
have is something that I've been working
on in the last couple of years which is
trying to which is connected to ants
provides learning but more generally to
the relationship between how we can
build intelligent machines and and the
intelligence of humans or animals and as
you may know this was one of the key
motivations for doing neural nets in the
first place the intuition is this that
we are hoping that there are a few
simple key principles that explain you
know what allows us to be intelligent
and that if we can discover these
principles of course we can also build
machines that are intelligent that's why
the neural nets were you know inspired
by things we know from the brain in the
first place we don't know this is true
but if it is then you know it's it's
it's great and I mean this would make it
much easier to understand how brains
work as well as building AI so in in
trying to bridge this gap because right
now our best neural nets are very very
different from what's going on in brains
as far as you know we can tell by
talking to neuro scientists in
particular
backprop although it's it's kicking
Assam from a machine learning point of
view it's not clear at all how something
like this would be implemented in brains
so I've been trying to explore that and
and also trying to see how we could
generalize those credit assignment
principles that would come out in order
to also do once provide learning
so we've we've made a little bit of
progress a couple of years ago I came up
with an idea called target prop which is
a way of generalizing back prop 2
propagating targets for each layer of
course this idea has a long history more
recently we've been looking at ways to
implement gradient estimation in deep
recurrent networks that perform some
computation that turn out to end up with
parameter updates corresponding to
gradient descent in the prediction error
that looked like something that
neuroscientists have been observing and
and don't completely understand called
SCDP spike timing-dependent plasticity
so I don't really have time to go into
this but I think this whole area of
reconnecting neuroscience with machine
learning and neural nets is something
that has been kind of forgotten by the
the machining community because we're
all so busy you know building
self-driving cars but I think over the
long term it's a it's a very exciting
prospect thank you very much
yes questions yeah
to begin with great talk my question is
regarding you know the lack of interlab
between the results in the study of
complex networks like when they study
the brain networks right there lot of
publications which that talk about the
emergence of hubs and especially a lot
of publications on the degree
distribution of the inter neuron Network
right but then when you look at the
degree distribution of the so-called
neurons in deep Nets you don't get to
see the emergence of 
Resume
Read
file updated 2026-02-13 13:23:59 UTC
Categories
Manage