Kind: captions
Language: en
so good morning
everyone so I'm going to talk about uh
some of the core methods in deep
reinforcement
learning um so the aim of this talk is
as follows um first I'll do a brief
introduction to what deepl is and um
whether it might make sense to apply it
in your problem um I'll talk about uh
some of the core uh
techniques uh so they're on the one hand
we have the policy gradient methods uh
then on the other hand we have uh
methods that uh learn a q function
including Q learning and
sarsa and um I'll talk a little at the
end about what are the pros and cons of
these different
methods so first what is reinforcement
learning um it's a branch of machine
learning cons uh concerned with taking
sequences of actions um
so um often uh it's described in in
terms of an agent inter interacting with
the previously unknown environment um
and it's trying to maximize some kind of
cumulative reward some kind of reward
function that we've defined um
accumulated over time and uh pretty much
any kind of task where you have some
kind of goal that you want to achieve
can be stated in these terms uh so this
is an extremely General uh
formulation uh what's deep reinforcement
learning it's pretty simple it's just uh
reinforcement learning where you're
using uh neural networks uh as function
approximators um so uh the interesting
thing about reinforcement learning and
contrast to supervised learning is um
it's actually not totally obvious what
you should use your neural network to
approximate in reinforcement learning
and there are different kinds of
algorithms that approximate different
things so uh one choice is to use the
neural network to approximate your
policy which is uh how the agent chooses
its actions um another choice is to
approximate the value functions which
measure how good or bad uh different
states are or how or
actions and um last you can use the um
you can try to learn a model of the
system a Dynamics model uh which will
make predictions about next States and
rewards okay so I'll now give a few
examples of different um different
places where you might apply
reinforcement learning and what the
observations and uh actions would be uh
so one example is
robotics um so here you could imagine a
robot where the observations are the
camera images and the joint angles of
the robot um the actions are the joint
torqus you're applying and
um the reward is going to depend on what
you want the robot to do so so this is
something we uh as the algorithm
designer get to Define so uh the rewards
could be uh to stay balanced uh to
navigate to some Target location or
something more abstract like Serve and
Protect
humans uh so reinforcement learning has
also been used in a lot of um more
practical applications um well
applications that have been practical in
the past uh I think robotics will be
very practical in the future um but uh
for example um one uh one area is um
Inventory management uh so this is just
one example of how you could use
reinforcement learning for a
decision-making problem
uh so you you have to decide how much to
stock up on uh of every item and uh your
observations would be your current
inventory levels um actions would be how
much of each item you're going to
purchase and uh reward is your
profit uh so people in operations
research this is uh this is a subfield
um study this kind of problem a lot
um okay there are also a lot of uh
machine learning problems where people
have started to apply reinforcement
learning
techniques so uh one example is um
attention um so the idea in attention is
you don't want to look at the whole
input at once uh you want to just focus
on part of it uh so uh one example of
this is um with a large image you might
want to just crop out part of it and uh
use that and just do detection on that
part of the image um so uh here your
observation would be your current image
window action is where to look or where
to crop your image um and uh reward is
um your whether you make a
classification error or
not so here the um the actions are
trying to um here you have to um try to
choose the right area of the image to
look at so you'll do the correct
classification um reinforcement learning
has also been used in um structured
prediction problems um which haven't uh
which in the past
weren't considered to be reinforcement
learning problems uh but it turns out
that um like to actually properly solve
them it it actually is a reinforcement
learning problem uh so machine
translation for example
um uh you so you get a sour a sentence
in the source language and you have to
emit a sentence in the target language
um and uh you can uh here your
observations are the sentence in the
source language you're emitting one word
at a time in the Target l language and
uh you have some reward function that
looks at the whole sentence and tells
you how good your translation was um so
since this is non-differentiable and
it's um you yeah you can't just uh like
differentiate through the whole thing
and do gradiant to sent so um it turns
out you have to do um you can use a
policy gradient method to optimize your
translation system um so people have
started to do
that okay so that's just those are just
a few examples um not exhaustive at all
um but uh I just want to uh since I just
want to say a little bit about how
reinforcement learning fits into um the
um fits into the picture of all the
other um types of machine learning
problems so previous um I mean previous
uh courses in this uh series have talked
about uh supervised learning and un
supervised learning so how does uh
reinforcement learning relate to them
how is it different um so let's just uh
first compare it to let's look at
supervised learning so in supervised
learning first um the environment
samples an input output pair from some
distribution
row um the agent makes a prediction um
why hat using its function f and uh it
receives some loss which tells it if it
made the right prediction or the wrong
prediction um so the interpretation is
um environment asks the agent a question
and then tells the right
answer um so contextual Bandits are um
make this problem a little harder in
that they give um The Learning agent a
little bit less information um so now
the environment samples an input um but
notice that there's not a correct output
associated with it um then the agent
takes an
action and uh the agent receives some
cost which is from um some probability
distribution so here um C is the cost
we're sampling it from some probability
distribution um and the agent doesn't
know what this probability distribution
is so that's what makes the problem
hard um so environment asks the agent a
question and uh the agent answers and
the environment gives her a noisy score
on the
answer um so this is applied um this
actually has a lot of applications so
personalized recommendations is one big
one along with advertising so um you
have to besides um like uh customers who
liked this I mean you for you have a
customer and you know what they liked in
the past so you have to make a
prediction about what they're going to
like in the future uh so you show them
appropriate ads or links like what
either like what ad what book you want
to try to advertise to them or what
video you want to show them and so
on um so here you can the big difference
between this and the supervised learning
setting is you don't have access to the
function uh the Lost function trying to
optimize so in particular you can't
differentiate through it um we don't
know the process that generates C so we
can't um compute the grading of the loss
function and use that to tune the
agent's parameters so that makes it uh
so that makes the problem a bit harder
or you you have to use a different kind
of
algorithm um lastly uh reinforcement
learning is um almost the same as the
contextual Bandit setting except now the
environment is
stateful so now instead of sampling um
the initial state from scratch every
time step uh from the same distribution
um the um State evolves over time uh so
you have some transition probability
distribution called P here where um the
the state X subt is uh conditioned on
the previous state and the previous
action and uh that makes the problem
quite a bit harder because now well for
a number of reasons uh for one thing the
inputs you're getting depend on what
actions you're taking so now that makes
it harder to develop a stable reliable
algorithm because now as the agent
starts to learn it gets different inputs
so that can lead to all sorts of um
outof control um
behavior and it also means you have
delayed effects because uh since the
system is stateful um uh you might need
to take a lot of actions to get into the
right state so um you might need to um
you can't just greedily every time step
you have to uh you have to think ahead
effectively okay so just to summarize
these differences there are two
differences the first one is you don't
have full analytic access to the
function you're trying to optimize you
have to query it through
interaction uh second uh you're
interacting with a stateful world which
means that the input you get is going to
depend on your previous actions and if
you just take the first of those
differences uh between supervised
learning and reinforcement learning you
get the contextual Bandit setting so
that's sort of halfway in
between okay so uh I realized that there
uh multiple this audience probably has
people with different interests uh some
people are um doing research and want to
know about what's the latest in the
research world and some people are um
want to apply these machine learning
techniques to practical applications um
so this slide is um for the latter group
of people um so if you're wondering um
if you have some problem where you think
reinforcement learning might be relevant
and you're wondering if you should apply
reinforcement learning um so first uh I
should say that the answer might be no
it might be Overkill especially uh deep
reinforcement learning so this is a set
of fairly new techniques where it's not
going to work out of the box very well
um and uh it's these techniques aren't
that well established so they require a
lot of they have a lot of knobs to be
tuned so uh it might be Overkill and
yeah these techniques aren't that well
established at the moment so it might be
worth investigating some other methods
first um so one one so if your problem
has a small number of parameters you're
trying to optimize over and um you have
a simulator that you can uh like just do
lots of experiments on um then
derivative free optimization methods are
likely to be better than reinforcement
learning or they're likely to be easier
to get working um so these methods just
uh look at um they just you give them a
blackbox function where you put in a
parameter vector and it'll give you a
noisy estimate of the score and these
algorithms will just optimize uh over
the parameters of that blackbox I mean
that are being put into that black box
um so uh yeah there's a variety of
different methods um for derivative free
optimization but these are easier to
understand than reinforcement learning
and they do kind of work out of the
box um okay a lot of problems are
actually um can be seen as cont
contextual banded problems and the
statefulness Of The World Isn't that
relevant um so for example in
advertising um this is where people
people look at advertising as a
contextual Bandit problem most of the
time because you decide what ad to
present the user with and then they
either um click on it or they don't um
but it's really um the user is kind of
stateful because if you show them a
terrible ad uh they might just go and
download ad block uh so uh there is like
your actions do have some repercussions
um but um often you can just approximate
it as being a contextual Bandit problem
where there is no state so uh there's a
better theoretical understanding of
contextual Bandit problems uh and
methods that are that have some
guarantees so in that case um so if it
is a contextual banit problem you might
want to use those kind of algorithms
instead um and lastly um the um
operations research field has been uh
using um these methods for a while on
real problems and um they have a set of
methods um which are um just pretty much
the basic algorithms uh policy iteration
and value iteration but they're um sort
of well um they're welldeveloped ways of
doing feature engineering for these
problems that end up working pretty
decently so these uh techniques are also
worth considering instead of trying to
throw a big neural network at
it okay so now well now that I've talked
about what why not to use deep
reinforcement learning or what it's not
good for um I'll just talk about um some
recent uh success stories in deep
reinforcement learning which are
achievements that probably wouldn't have
been possible using these other
techniques um so um a a few years ago
there is a pretty um influential result
um by uh Min all from Deep Mind uh where
they used um a deep Q learning algorithm
um to play Atari games using the screen
images as
input
um and uh that's hard because you have
these G these games are you're trying to
do different things in all these games
and there're some of them are kind of
complicated so it's pretty remarkable
that you can just use a simple uh that a
simple algorithm can solve them all um
this the same algorithm can solve them
all uh so since then people have also um
solved or or solv this domain using uh
policy gradients and another algorithm
called
dagger um so another big uh
groundbreaking result was um beating a
um a champion level player at go um also
by Deep Mind um using a combination of
um super learning from uh like from
expert games plus policy gradients to
fine-tune the supervised learning policy
um plus Monte Carlo tree search um plus
value functions to make the search work
better so a combination of techniques
and reinforcement
learning um robotic so some of my
colleagues at uh Berkeley had some um
very nice results uh learning in real
time how to do manipulation tasks um
using an algorithm called guided policy
search um using the PR2
robot um and uh some of my colleagues
and I have um been working on robotic
Locomotion um using um policy gradient
methods and uh people have been working
on Locomotion for a while and have been
able to achieve pretty good results uh
using uh very like highly engineered
domain specific methods but um
previously there hadn't been much
success using general methods to solve
it and last uh there have been some
recent results um playing 3D games using
policy gradients um in fact there was
even a contest I heard about a couple
days ago with this new visz Doom uh task
which um is pretty nice so you might
want to check out viz
Doom okay so that's that's it for the
highle overview uh part of this um now
I'm going to start getting into the
actual formalism and the technical
details okay so the basic object uh in
uh the field of reinforcement learning
is the markof decision process um so the
markof decision process is defined by
the following components you have a
state space this is all the different
states of the system uh the action space
these are all the actions the agent can
take and you have um this probability
distribution um which uh which
determines the probability of next date
and reward so R is the reward S Prime is
the next state s and a are the actions
so it's a conditional probability
distribution sometime sometimes people
split this out into a separate reward
function but that's basically an
equivalent
formulation okay and sometimes there's
some extra objects to find um will
will'll be interested in the we'll we'll
consider an an initial State
distribution so this is um the world
starts out in a certain
State and uh the typical optimization
problem you want to solve given this mdp
is to maximize expected cumulative
reward though there are various um ways
of defining that more precisely which
I'll go into uh
later okay so there are various
different settings of reinforcement
learning um where you define a slightly
different optimization problem the one
we'll be most concerned with is called
the episodic setting so here the agents
experience is split up into a um a
series of episodes which have um finite
length so in each
episode uh we first sample the initial
state of the world from some probability
distribution me and then um the agent uh
keeps on acting until um the world ends
up in some terminal state
um so just to give a example of what
terminal States might be like and how an
epis episodic um reinforcement learning
problem might look um so one example is
um when termination is good and you want
to terminate the episode as fast as
possible uh so if we imagine setting up
a task with some kind of Taxi robot that
should get to the destination as fast as
possible then the episode would be like
one trip and uh it's terminate it's
trying to terminate the episode as fast
as
possible um another example is um a
waiter robot um where you have a fixed
length shift but the waiter has to
accumulate it has to do as well as
possible during that shift so there the
episode has a fixed length um the waiter
has to say maximize tips or uh customer
happiness um and then you could imagine
another kind of task where uh
termination is bad and you want the
episode to last as long as possible um
so you can view life as an example of
that um but also you could imagine
having a a walking robot um where uh you
want it to walk as far as possible
before it falls
over and in this setting it's pretty
easy to find to Define what the goal is
um to we just want to maximize the
expectation of the total reward per
episode
okay and the last object we're going to
introduce here is um a policy so the
policy is just the function that the
agent uses to choose its
actions so we have deterministic
policies which are just uh the policy is
denoted by pi so we have the action is
just some function of the state and uh
we also have U stochastic policies where
the policy is a conditional probability
distribution um so here is just we're
just going to make a little bit more
precise um the setting of The episodic
mdp um so first we sample the initial
state from this distribution me um then
we um then we get uh we sample the first
action from the policy a Zer from the
policy then we sample next state and
reward uh from the transition
probability distribution and so on until
we reach a terminal State s subt and
then um the quantity we care about is
the sum of all these rewards r0 plus R1
dot dot dot plus r subt minus one and um
we want to maximize yeah so Ada is Ada
of Pi is just defined as the um expected
total reward of the policy
Pi here's a picture that um illustrates
exactly the same thing so you can look
at it as a graphical
model okay and lastly um in the policy
gradient section in particular we're
going to be interested in parameterized
policies so here we have a parameter
Vector um Theta which specifies U which
specifies exactly what the policy is so
um for example the family of policies
could be just a neural n you have a
certain neural network architecture and
Theta specifies all the weights of this
neural
network so we could have a a
deterministic policy of course or
stochastic policy
um and if you're wondering like
concretely what would a policy look like
I mean how do you use a neural network
to represent your policy it's actually
exactly you do exactly the same thing
you would do if this were a
classification or a regression problem
uh so uh in so s here the state here is
your input and the action is your output
um so um if you have a discrete action
space a discret set of actions um then
um you would use a Network that outputs
a vector of probabilities the
probabilities of the different actions
this is exactly like a
classifier and if you have a continuous
action space um you you would have your
neural network output the mean and uh
the diagonal of a covariance matrix of a
gaussian distribution um so this is just
like you're doing regression so you can
use the same kind of architectures You'
use in supervis
learning okay so that's uh that's just
the that's it for the formalism of
mdps so now I'm going to go into policy
gradient methods which are one uh Broad
and general class of reinforcement
learning methods which are um quite
effective so to give a brief overview of
this
um here's here's the intuition of what
policy grading methods are going to do
um so here capital R means the sum of
rewards of the whole episode episode um
so our optimization problem is we want
to maximize the expectation of the total
reward um given our parameterized policy
Pi sub
Theta and um the intuition of how our
algorithm is going to work is um we
we're going to collect a bunch of
trajectories I mean this is just run a
bunch of episodes using our policy and
then we want to make the good
trajectories more probable so I mean
some of the trajectories were lucky and
they were really good some of them
the agent was unlucky and they were bad
and um The Good the ones that were good
meaning there was high reward um that
means the agent probably took good
actions there so we want to uh increase
the probability of the actions from
those trajectories so um so the most
basic version of uh policy gradient
methods just try to make the good
trajectories more probable without
trying to figure out which were the good
actions and which were the bad actions
um slightly better methods or more um
elaborate methods uh TR to figure out
which actions were good and which ones
were bad and then they try to make the
good actions more
probable and um lastly there's another
class of methods which um which actually
try to push the actions towards better
actions so they differentiate the loss
function with respect to the actions and
they try to push the actions to better
actions um so we're mostly going to talk
about one and two
here oh there's a question
oh uh yeah good question so um well
we're maximizing over the policy we're
trying to find uh the best policy but
here um the policy is assumed to be
parameterized so there's some parameter
Vector Theta that specifies the policy
and now we just want to maximize with
respect to
Theta any other
questions okay
um so there's a very um a very
fundamental fundamental concept which is
called the score function grading
estimator uh which um underlies policy
gradient methods so actually to
introduce this we're not going to talk
about policies and RL at all we're just
going to assume uh we have some
expectation we have expectation of f ofx
where X is sampled from some uh
parameterized probability
distribution so we want to compute uh
the gring of this expectation with
respect to Theta um so there's a general
formula um that'll do this and the way
you derive it is you just write the
expectation as an integral um and then
you just um move some things around uh
you you swap the integral with the
derivative and you um you turn it back
into an expectation and uh what you get
at the end is this bottom line which
says that you take the expectation of
function value times grad log
probability
uh so the in this is an unbiased
estimator of the
gradient meaning if we get enough
samples it'll Converge on the right
thing um so uh the way you can compute
this estimator meaning the way you can
get a noisy estimate of the grading of
the expectation is you um just collect
one you just get one sample um from this
distribution and then you compute then
you multiply F ofx times grad log
probability
um so uh the only requirement for being
able to use this estimator is uh we need
to be able to compute the probability
density I mean we need to be able to an
analytically compute it and we need to
be able to differentiate it with respect
to Theta and um often it needs to be
differentiable um there's another uh way
of deriving it using important sampling
so you write down the important sampling
estimator for the expectation and then
you just uh swap the derivative with the
expectation and you get the same
thing okay so so now let me just give a
little bit of intuition about this
estimator oops okay so f ofx is
measuring how good the sample X is um so
that means that so G hat here is our
gradient estimator meaning this is what
we get if we take one sample X subi and
we compute our estimator this is our
estimate of the gradient um so if we
move in Direction Gat um that pushes up
the log probability of our sample xabi
in proportion to how good it is so if we
have really good um if we got a really
good function value then we're going to
try to push up its log probability a lot
and if it was a bad function value then
we're not going to try to push it up
very much so it's pretty simple
intuition um the really nice thing is um
this is valid even if f ofx is
discontinuous or if f ofx is um unknown
meaning you only uh you don't get to
differentiate it you just get to see the
function values um or um the sample
space um is a discrete set so X doesn't
even have to be continuous um and this
is um quite uh this is quite remarkable
actually that you don't even need to
have access to the full um you don't
need to know exactly um what the
function is that you're optimizing you
just have to be able to query
um for the function value um and this
means this is a way of um being able to
differentiate um
functions through a system that has
non-differentiable pieces um so for
example in um in robotic Locomotion one
issue is that um you have contacts
between the robot's foot and the ground
and um contact you make and break
contact and that causes a discontinuous
change in the Dynamics um so that makes
it really hard to do smooth optimization
techniques to come up with the right
Behavior so when you use this kind of um
grading estimator along with policy
gradients which I'm going to uh talk
about very soon um you can actually just
uh differentiate you can optimize this
system um even though it has
differentiable pieces in
it
okay so
uh okay so here's another little picture
of what's going on so we have our
function f ofx um which we're trying to
maximize the expectation of and then we
have our probability density P ofx um so
we just sample a bunch of values from
our probability density those are the
blue dots on the x axis and um then uh
we
um so then we we look at the function
values and um we're trying to push the
uh probability distribution so that the
probability goes up at um these samples
in proportion to the function value um
so over on the right side of the curve
uh that means we're trying to push that
F uh probability value up really hard
and on the left side we're pushing it up
softly uh so what's going to happen is
the probability density is going to
slide to the right if you can imagine a
sort of physical analogy
there okay so that's that's the score
function gradient estimator this is a
general technique um it can be used in
various machine learning problems
um now we're going to apply it to the
reinforcement learning setting and um
we're going to take our random variable
X to be a whole
trajectory um so the trajectory consists
of State action reward State action
reward and so on until the end of the
episode and uh now um to get our
gradient estimator uh to get the um uh
to get the gradient of the expected
reward all we've got to do is um compute
the grad log probability uh times the
total
reward so um so this uh probability of
the trajectory that sounds like a really
unfriendly quantity because uh there's
uh a long complicated process that's
generates this trajectory with lots of
uh lots of time steps but um log um okay
so we can write out what this process is
what this probability density is um so
we have uh it's just a product of
probabilities we've got our initial uh
we've got our mu of s0 which is just our
initial State distribution and then
every time step we have um we sample the
action according to Pi and we sample the
next state and reward according to our
Dynamics
model so uh log turns that product into
a
sum and here's the cool part um
everything that doesn't um contain Theta
drops out um so the thing is we didn't
know uh there are parts of this um
probability uh distribution P of to
given Theta that we don't have access to
so if this is reinforcement learning uh
we don't assume that we know the
Dynamics model of the system we just
find out about it by sampling uh by
doing sample doing episodes um so um so
since this uh product turns into a sum
all the the pieces uh like the log uh
log P there and the log me uh which we
don't know just drop out so it doesn't
matter um and uh what we get in the end
is um we get a sum of log probab sum of
uh log probabilities of actions so grad
log Pi of action given
State um so our formula looks like um
our formula for the grading of the
expectation is just the expectation over
trajectories of um total reward of the
trajectory
times grad um grad of the sum of all the
log
probs so the interpretation of this is
um we're uh taking our good trajectories
and we're trying to increase their
probability in proportion to how good
they are um and you can think of this as
uh being similar to supervised learning
where we treat the good trajectories
with high rewards as um positive
examples in our supervised learning
problem so we're using those to train
the policy and which actions are good
we're basically treating those actions
as positive
examples okay now we can improve this
formula a little bit um so that was just
uh the most basic uh I mean this is an
unbiased estimator for the policy
gradient so uh if we just take that
expression inside the expectation on the
right hand side and we take one sample
of that it has the right mean so if we
just get enough of them we're going to
get the policy
gradient um okay so that's um but we can
also write down some other formulas uh
that have the same mean but have lower
variance so we can come up with better
estimators for the policy gradient um
and that's actually quite important
because the one from the previous slide
is really bad when you have uh a long a
large number of time steps meaning it
has really high
variance so uh the first thing we can do
is you uh we can use the temporal
structure of the problem
um by the way to derive these next bunch
of formulas it just takes a bunch of
really straightforward manipulation
where you move around expectations um
and I'm not going to go through all the
math um but uh I'll just say what the
formulas
are so um okay so we can repeat the same
argument from the previous slide um to
just derive the gradient estimator for a
single reward term so we end up with
that reward term times the grad some of
log
probs and just summing over that we get
a new formula um where we're not
multiplying the sum of the the grad log
prob of the whole thing times the sum of
all rewards now um so let's look at that
bottom formula um now we have a sum over
time of grad log probability of the
action at time that time times the sum
of future rewards um so so now I mean in
the formula from the previous slide we
would have had all the rewards in that
sum um but now we just have the future
rewards and um that kind of makes sense
because um an action can't affect the
probability of the um previous rewards
uh so to figure out if the action is
good we should have only we should only
be looking at the future
rewards so this is a slightly better
formula than the one on the previous
slide meaning it has this exact same
mean um except uh different uh the expr
inside the expectation there has lower
variance um and we can further reduce
the variance by introducing a
baseline um so now uh we can take any
old function uh B which takes in a state
and it outputs a real number and um we
can subtract it from our sum of future
rewards and um we didn't affect the mean
of the um estimator at all so we yeah we
didn't change uh the expect At All by
introducing this
Baseline um so yeah for any choice of B
this gives us an unbiased estimator by
the way if you're not that familiar with
the terminology of estimators what I'm
saying is uh we have a um expectation um
on the right hand side of that for uh
formula uh and uh the quantity inside
that expectation is What's called the
estimator and um if we get a bunch of
samples uh then we can get an estimate
of um
of the thing on the left hand side which
is what we care
about so um so when I say it's an
unbiased estimator that just means that
well that just means that this equation
is correct meaning that the thing on the
right hand side equals the thing on the
left hand
side um so yeah this works for any
choice of Baseline and um a near optimal
choice is to use the expected return so
the expected sum of future
rewards and uh the interpretation of
that is um if we took an action we only
want to increase the probability of the
action if it was a good action um so how
do we tell if it was a good action well
the sum of rewards after that action
should have been better than expected um
so the B ofs is the expected sum of
rewards and we're just taking the
difference between the measured thing
and the expected
thing yeah
okay so uh that's
okay that's the that that was a pretty
key thing for variance reduction um and
I'm going to introduce one last um
variance reduction technique and
actually all three of these are really
important so um basically nothing's
going to work um except for maybe really
small scale problems unless you do these
things um so the last variance reduction
technique is to to use discounts um
so um the discount Factor um ignores
delayed effects between actions and
Rewards so what we we're going to do
here looks kind of like a hack but
there's an explanation for it um which
is instead of taking the sum of rewards
uh we're going to take a discounted sum
of rewards meaning that um we uh we add
this exponential Factor uh gamma so that
um when so when we're multiplying the
grad log probability by some future
award uh we multiply it by some uh
quantity that decays with time so people
typically use like gamma equals 99 or
gamma equals 095 uh so that means like
if you Ed 099 that means after 100 time
steps um you're going to be um uh you're
going to be reducing the reward by a
factor of one over e so um so you're
exponentially um you're decaying the um
effect of the future rewards and the
intuition is that um an action uh the
action shouldn't affect rewards really
far in the future like the system should
um the s u the like the assumption is
that the system doesn't have really
long-term memory and it's sort of resets
it's or or the there aren't effect the
effects aren't that far delayed uh so
you can just ignore um the interaction
between action and a a reward way way in
the
future that's the uh intuition um so now
instead of taking the Baseline to be the
expected sum of future rewards we want
to do a discounted sum uh so now were
measuring if the action was better than
expected according to this um like the
according to the discounted
sum um and now there's a more General
class of formulas that looks like the
one that I just wrote so this this one
that's on the top of the slide is pretty
good and um this is like almost as good
as anything you're going to do to within
a small constant Factor uh but um
there's there's a more General class of
formulas that um look like um grad log
probability times uh some quantity a hat
which we call the advantage estimate and
this is in general just going to be um
an estimate of um this is an it has a
more a precise definition which is how
much uh how like how much was this
action um better than the um average
action taken by the policy but in but
informally this just means how much
better was the action then
expected so and and this formula makes a
lot of sense because we we want to
increase the probability of the good
actions and de decrease the probability
of the bad ones so we should um we
should increase it in proportion to the
goodness of the
action okay so just to summarize so I
just told you there's this gradient
estimator meaning there's this
expression you can compute which gives
you a noisy estimate of the policy
gradient so how do you actually turn
this into an algorithm uh so this is
silly algorithm s uh so um so here's
what the algorithm looks like it's
pretty much what you'd expect uh you um
you take your policy um you initialize
your policy parameter and your Baseline
function um you uh for each it each
iteration you um execute the um the uh
current policy to get a bunch of whole
episodes meaning whole trajectories and
um each time step in the each trajectory
you should compute the return meaning
the sum of rewards following that time
step the sum of discounted rewards and
the advantage estimate which is um the
sum of discounted rewards minus the
Baseline uh then you refit the Baseline
by trying to um make the Baseline
function equal the returns uh and then
um you update the policy using a policy
gradient estimator so you're just doing
SGD while updating the Baseline as you
go
along yeah so that's that's the vanilla
policy gradient algorithm um and this is
um I'll briefly talk this has been used
to obtain some pretty good results so
it's not um that bad of an algorithm but
um there there several different
directions that it can be
improved so one one uh issue that you
run into um is with step sizes um so in
supervised learning step sizes aren't
that big of a deal um
because uh maybe you take too big of a
step um but that's okay um you'll fix it
the next update and um your uh current
function your current classifier for
example doesn't affect what inputs
you're getting so even if you just um
are doing really uh even if your network
is just kind of thrashing around for a
while because you're taking too big
steps uh that's not going to cause any
problems um but
um uh yeah and reinfor so yeah so step
sizes aren't that big of a deal you can
just anal them uh you can start off with
a large step size and anal them down to
zero and that um works pretty well um in
reinforcement learning if you take too
big of a step you might wreck your
policy um and even if you don't actually
change the network that much so you
don't lose all your nice features um you
you might just change its Behavior a
little too much and now it's going to do
something totally different and visit a
totally different part of State space um
so since in reinforcement learning the
system is stateful and your state
distribution depends on your policy that
makes that like brings uh a really a
different problem and uh now like after
you took that step the next batch of
data you're going to get was collected
by the bad policy and now you're never
going to recover because you just forgot
everything
yeah so um One Way um that uh my
colleagues and I well one way to fix
this is to try to um to try to stop the
basically try to stop the policy from
taking too big of a step so um you can
look at the K Divergence between the um
old policy and the new policy um like
before the update and after the update
and make sure that um the uh
distributions aren't that different so
you're not taking too big of a step it's
kind of an obvious thing to do uh so my
colleagues and I developed an algorithm
called trust region policy optimization
um which looks at the yeah looks at the
action distributions and tries to make
sure the K Divergence isn't too large
and uh there's this is very closely
related to previous meth natural policy
gradient methods which uh which are
based on um which are doing something
similar but usually it's not um set up
as a hard constraint on the K
Divergence so another um type of
extension of policy gradient methods is
um to do more uh to use value uh value
functions to do um more variance
reduction um instead of just using them
as a baseline you can also um you can
use them more aggressively and introduce
some bias um so I won't go into the
details in this talk um but um sometimes
these are called actor critic methods
um there's also another type of approach
which I briefly touched on in the um
earlier slide um where instead of just
trying to increase the probability of
the good actions you actually
differentiate your loss with respect to
the actions um this is like the
reparameterization trick which is used
um for um like for density modeling and
unsupervised learning
um so uh here you're trying to instead
of just increasing the probability of
the good actions you're trying to push
the actions towards better
actions and I'd say both of these bullet
points um you're um potentially
decreasing your variance a lot but at
the cost of increasing bias so it's
actually U makes the algorithms a little
harder to um like to understand and to
get them working because um with high
variance if you just uh crank up the
amount of data you can always drive your
variants down as much as you want but
with bias even if no matter how much
data you get you're not going to get rid
of the bias so if your grading is
pointing in the wrong direction then
you're not going to learn anything
okay so now uh that that's it for the
policy gradient section of this um this
talk um so I wanted to show a quick
video of uh some work that my colleagues
and I did on learning Locomotion
controllers with policy gradient methods
which I think um well I found pretty
exciting when I saw it uh
so hopefully
it's you find it
interesting so here what we've got is a
um humanoid a
simulated let's see it okay yeah it's a
simulated humanoid robot um in a physics
simulator a realistic physics simulator
called
Moko and uh it has a neural network
policy uh which takes in um The Joint
angles of the robot and uh maybe some
and a little bit of other kinematic
information like joint it's got joint
velocities and also um positions of the
different um Links of the robot so
that's what the input is it's pretty
much the raw um state of the robot like
no clever feature engineering there and
um the output is going to be the joint
torqus which are set 100 times a second
so we're just mapping from joint angles
to joint
torqus and uh we Define a reward
function which is to move forward as
fast as possible so it gets a reward for
moving forward and um it gets a uh so um
the episode ends when it its head goes
below a certain height meaning it fell
over so that's basically the setup there
was a little bit of tweaking for the
reward function but um not too extensive
um
so
whoops yeah so you can see first it just
Falls forward a lot of times and then
slowly it starts to develop a uh half
decent looking walk and
uh eventually it gets it down pretty
well and at the very end of this um it
could just keep running uh indefinitely
so I think it was actually stable in a
strong sense meaning I could just leave
it for 15 minutes and it wouldn't fall
over it would just keep
going so uh here's another um robot
model that um we just created without
too much thought I mean we just decided
to put a bunch of legs on this thing um
and uh so we don't even know how this
thing is supposed to walk um and uh just
give it to the same algorithm and it
just figures out some kind of crazy way
to walk
um so that's the nice thing about
reinforcement learning uh you don't even
need to know what you want it to do um I
think this is also the physics are a
little unrealistic here
but here we set up up we used this um a
similar model to the one in the first uh
demo but uh here we just give it a
reward for having its head at a certain
height so there's a re word telling it
to get its head up as high as possible
and then it figures out how to get up
off the
ground oh let's see um I have I have low
battery uh does anyone have a charger
that I could
oh thanks a lot you're a
lifesaver okay any questions about
policy gradients before I move on to the
next
part m
oh yeah so the question was is the
system time invariant uh yes the that's
that's
assumed that it's
stationary oh right and also that it
doesn't change from one episode to the
next of course in some real world
problems that might not be the case so
that's I think that's also an
interesting problem setting where you
have a non-stationary
environment um so the question question
was uh for the Baseline to learn a good
Baseline uh do you need to know the
Dynamics of the system um so no you can
just learn it by doing
regression you just uh estimate the
empirical returns and then you do
regression to try to uh fit a function
to that
yeah so the question is
um there's a discount factor which um
causes the um which should cause the
policy to disregard any effects that are
delayed by more than a 100 time steps so
um how does it still work that this guy
learns how to stand up um even though
that might take more than 100 time steps
is that correct yeah um so yeah you're
right um and in fact I would say that
these methods um aren't guaranteed to
work well if you have more than a 100
time steps uh so sometimes they work
anyway often they work anyway but
there's no guarantee um so I think
there's actually something pretty
fundamental missing in how uh like how
to deal with really long time scales and
people have recently been thinking about
hierarchical reinforcement learning
where you have um different levels of uh
detail of the system where you might
have a like one level of description
where you have a um like a short time a
small time step and then you have
successively larger time steps and uh
you can that allows you to plan over
much longer Horizons um so that's
something that's currently in active
area of research but yeah I would say
that none of these methods are going to
um do are guaranteed to do anything
reasonable if you have uh more than one
over one minus gamma time steps uh
between action and
reward oh yeah so in this kind of task
if you introduced terrain or something
could it uh I think if it didn't if you
didn't train it to deal with terrain
then it um then it might fail it
probably would fail actually I don't
think it would fail because uh the funny
thing about these policies are actually
really robust because um you train them
with the stochastic policy policy um so
there's a lot of noise being generated
by the policy itself um so in practice
uh it's um it's actually so it's able to
deal with huge noise introduced by the
policy and as a result um I found that
if you um change the Dynamics parameters
a little it can usually still work but
yeah there's no guarantee that it'll do
anything if you give it something you
didn't train it for um I I think that
you probably could train it um this do
the same kind of training with uh
terrain I didn't have any terrain so I
didn't try it but that would be nice to
try okay I'm going to move on to the
next part of the talk uh feel free if
you have more questions to find me
afterwards okay so now I'm going to talk
about a different uh type of
reinforcement learning
algorithm so okay so these uh
so the previous kind of methods are
distinguished by the fact that they
learn they explicitly represent a policy
which is the function that chooses your
actions and they try to optimize it with
respect to the parameters of the policy
um so the nice thing about the policy
gradient methods we just talked about is
that you're optimizing the thing you
care about um so and you're optimizing
it with gradient descent so that makes
it kind of easy to understand what's
going on um because if you take if
you're getting the proper grading
estimate and you take small enough steps
then you should be improving I mean of
course you still could get stuck in a
local minimum but at least uh or you get
stuck in a bad local minimum but at
least it's a local minimum and you can
use the our understanding of
optimization to figure out what's going
on so these next class of methods are a
little different because um they're not
optimizing the policy directly uh
they're learning something else called a
q function uh which measures how good um
State action pairs are are so it
measures um I'll I'll say that more
formally L later but it's just measuring
how good the actions are um
and uh these methods are actually um the
these are um able to ex exactly solve um
mdps efficiently in uh the setting where
you have a finite number of states and
actions um so these are these are the
preferred methods for exactly solving
them in in those settings um but um you
can apply them uh with um continuous
States and actions and um using um using
expressive function approximators like
neural
networks but it's a little harder to
understand um what's going on in these
methods like when they're going to work
and when they're not going to
work so um I'll Define um the relevant
quantities here uh so the Q function is
defined as uh the expected sum of
rewards um when we condition on
the first state and the first action um
so we're conditioning on s0 equals s a0
equals a and we're um we're and the Q
function is the expected discounted sum
of rewards uh when we're acting under
the policy
Pi
so um by convention I'm starting out
with uh time Step Zero I could have also
said that um we're taking RT plus RT + 1
plus RT plus two and so on uh but since
we're assuming the system is stationary
it should be exactly the same so just by
convention I'm going to say that the
first I'm going to always use time Z 1
two three and so on Just for ease of
notation so the Q function is just
telling you how good and this state
action pair is under your current policy
um the value function well the state
value function usually called V is uh
just um conditioning on the state
it's uh telling you how good that state
is what's the expected reward at that
State and lastly there's an the
advantage function is the difference
between the Q function and the state
value function meaning how much better
is that action than uh what the policy
would have
done we're not going to talk about
Advantage functions in this section but
it was actually this corresponds to the
notion of Advantage estimator uh we
briefly mentioned in the previous
section so here we're going to consider
um methods that explicitly store and
update the Q function instead of the
policy and um updates them using uh what
are called Bellman
equations so um so the Bellman equation
um so a Bellman equation in general is a
consistency equation that should be
satisfied by a value
function um so here um I'm writing down
the Bellman equation for QP and
um what it's saying is that um the um
expected sum of rewards should be um the
first reward plus this expected sum of
rewards at after the first time step so
it's saying something pretty simple
that's um so r0 is the first reward uh V
Pi of S1 is just um adding up all the
rewards at after at After Time Step
Zero um
so uh in the second equation we write
out this relationship just involving the
Q function so we have a consistency
equation that the Q function should
satisfy um we can slightly generalize
this to use um ktime steps instead of
just one time step so uh we can expand
out the um expectation the expected sum
of rewards to write write out K rewards
explicitly and then uh cap it off with
the value function at the very end which
accounts for all the rewards after
that okay so here's the Bellman equation
from the previous slide so now I'm going
to introduce a very important concept
called a Bellman
backup so uh so we have this equation
that the uh value the um value function
Q Pi should satisfy um but we don't know
Q let's assume we don't know Q Pi so
let's say we have some uh some other Q
function um so we Define this Bellman
backup operator that that operates on an
arbitrary Q function so it maps The Q
function to a new Q function and uh it's
defined by just taking the right hand
side of the Bellman equation and uh
plugging in um our Q function our new Q
function Q instead of the um Q
Pi so uh Q Pi is a fix point of this
operator u meaning if we apply the
backup operator we get it the same thing
back and um and very nicely if we keep
applying this backup operator repeatedly
to any old arbitrary initial um Q
function Q the series will converge to Q
Pi which is the fix point of the
operator so that's
uh yeah so that's um so that way you can
uh you can um
one way you can use an iterative
algorithm to estimate Q Pi by taking any
old initial Q function and repeatedly
applying this backup
operator um so now there's another kind
of Q function that we're going to
introduce uh called qar so the previous
Q function Q Pi was this is the uh
telling you uh the value function under
the current under some policy Pi so it
only makes sense with regard to some
particular fixed policy Pi qar um
is going to be um is going to involve
the optimal policy instead so um so qar
is just defined as the Q function of the
optimal policy so here we have Pi star
the optimal policy and qar is just the Q
function of the optimal policy and um it
also happens to be uh the pointwise
maximum over all policies of uh the Q
function um at each state action pair
so uh so the optimal policy is
deterministic and um it should satisfy
this equation that um it takes the
argmax of the optimal Q function so
recall that the Q function tells you
your expected return if you take the um
the given action um so obviously the
optimal policy should take the action
that has the best expected return so
that's why um that's why this last
equation is um evident um
so um so now now that we know this
property of the optimal policy uh we can
rewrite the Bellman
equation so uh so on the that that first
equation is that's just the Bellman
equation from the previous slides for
given policy Pi um now um we can take
that expectation over actions and
replace it by what the optimal policy is
going to do which is just going to take
it's going to take the argmax of the
optimal Q function there's a typo on my
slide that should say qar um inside of
um on the right hand side so um so now
we have a Bellman equation that the
optimal policy should
satisfy now we can do the same thing
with the backup operator um so um we we
take that Bellman equation and we uh
plug in an arbitrary Q function on the
right hand side instead of the optimal Q
function
qar um so qar um is a fixed point of
this Bellman operator that's just a
restatement of the Bellman
equation and uh again if we reply um
this Bellman operator repeatedly to an
arbitrary initial Q function it
converges to qar which is the optimal Q
function this is um the BAC F fixo
theorem in both cases can be used to
prove it
okay so based on these ideas um there
are two classic algorithms for exactly
solving mdps these are sometimes called
dynamic programming algorithms because
they're actually quite related to the
kind of dynamic programming algorithms
that are used to solve uh search
problems
um so one is called value iteration and
you just initialize your Q function
arbitrarily and you repeatedly do
Bellman backups until it converges
uh the second one is called policy
iteration um you initialize your policy
arbitrarily uh then uh each step you uh
first can compute um either exactly or
approximately uh the Q function of that
policy and then uh you update your
policy to be the greedy policy for the Q
function you just computed uh so that
means that uh you your new policy just
takes the argmax of the Q function so it
takes the action that's best according
to that Q
function um so I didn't say anything
about how you compute QP uh so one way
to do it is to compute it um you can
compute it exactly because it happens
that the Bellman equation for QP is a
linear system equations so often you can
just solve them exactly um more commonly
well if you have a large scale problem
you might not be able to solve this
system uh so what people often do is
they do um a finite number of bellman
backups uh which gives you which doesn't
exactly Converge on Q Pi but it gives
you something that's approximately Q
Pi okay so that's um I just told you
algorithms that you can Implement if you
have full access to the mdp like you
know the whole table of probabilities um
but in reinforcement learning usually
the assumption is that you don't know
any of these probability distributions
you don't know the reward function you
don't know the transition probabilities
so all these things things have to be um
estimated from data or they have to or
you're only able to access the system
through interaction so now it turns out
that these uh algorithms can be um also
implemented if you only access the
system through interactions which is
kind of remarkable I
think um so so the way it works is um so
let's recall our backup formulas for Q
pi and
qar um so we can um so we can in both
cases we have this a certain quantity
inside an expectation in both both cases
we can compute an unbiased estimator um
of the right of that quantity inside the
expectation just using a single sample
meaning uh if we have uh if we sampled
some data from our system um using any
old policy uh
then uh we can get an unbiased estimator
of the quantity on the right hand side
of those expectations
I mean the quantity on inside of the
right hand
expectations so basically we can do an
approximate version of this uh Bellman
backup uh which is
unbiased um and uh even with this noise
so we're doing a noisy version of the
Bellman backup even with this noise it
can be proven that if you do if you
choose your step size appropriately with
the right schedule you're still going to
converge to um QP um or qar um depending
uh on which algorithm you're
implementing okay so now well well I'll
say at this point that this is uh pretty
much the fundamental idea and now you
can uh you can come up with algorithms
uh that can be applied in the uh
reinforcement learning setting where
you're just accessing the system through
sampling and you can also uh start
introducing function approximation here
so in I haven't said anything about what
the Q function is I've just told you
it's a function of state and action um
but now we can start having neural
network Q functions for
example um so uh so we can parameterize
the Q function with the neural network
um let's call it Q Theta um and now um
instead of doing the Bellman backup I
mean it doesn't make sense to do the
Bellman backup exactly because we're not
just setting the values of the neural
network output um the best we can do is
try to um like encourage the neural
network to have some output values so
what we do is instead of doing the um
the way we do this backup is we set up a
lease squares problem uh so we write
down this quadratic objective that says
that the Q function should be
approximately equal to the back upb
value and then we just minimize it with
uh
SGD um so one version of this algorithm
uh which was introduced about 10 years
ago called neural fitted Q iteration um
well it works exactly the way you'd
expect you you sample trajectories um
using your current policy uh which might
be um determined by the Q function or it
could be any old policy as it turns out
um and uh then you um you solve the
lease squares problem where you're
trying to minimize um this quadratic um
you you try to minimize this quadratic
error which is um based on the um
Bellman backup the backup for qar
so
um one thing I haven't mentioned so far
is what do you actually use as your
policy so I said sample trajectory using
your policy um so if you have a Q
function you can turn it into a policy
um by just uh taking the action that has
the highest Q value that's what you
typically do so the Q function measures
the goodness of all your actions so you
can easily turn that into a policy by
taking your best action or by taking
actions um where uh the log probability
is um proportional the goodness or
something like
that so you you might take typically
probability is uh is exponential of Q
value um over some kind of temperature
parameter um that's called boltzman
exploration um whereas um if you use
just the um greedy if you just take the
argmax that's called the greedy policy
so um it turns out that with these kind
of Q learning algorithms you don't have
to execute the greedy policy um to for
learning to work um there's you actually
have some freedom in what policy you can
execute um which is actually one very
nice property of these algorithms that
you can use an exploration technique um
which uh where your policy is actively
trying to reach um new states or do
something new and uh still learn the
correct uh still uh converge still move
towards qar or QP as the case may
be okay so that's uh
so that's a um a very basic neural
fitted Q iteration is sort of a basic
way of doing this a more recent
algorithm that's gotten a lot of
attention is the one that was um from M
at all from Deep Mind uh which is
basically an online version of this
algorithm uh with a with a couple of um
useful tweaks in it so um and but
actually when you look at the two tricks
they're actually kind of um very um they
make a lot of sense if you just think
about what valuator is doing so uh one
uh one technique is uh you use this uh
replay pool where it's a rolling history
of your past data and um that's just the
data you're going to use to fit your Q
function um so that makes sure you have
like a representative sample of data um
to uh fit your Q function to and um the
second the second uh idea is to use a
Target Network um so instead of using
your Q current Q function and just doing
Bellman backups on that um you have some
lagged version of your Q function so you
have this target Network which is a copy
of your Q function at some earlier time
and you use that in the backups so that
also um if you think about value
iteration uh you're trying to you have
your old Q function and you're trying to
make the new one equal to the backup
version of the old one so using the
target Network just is sort of the
natural thing to do if you're trying to
implement value iteration in an online
way um so and there have been many
extensions for post since then I've got
a bunch of citations at the bottom of
the slide um so this algorithm the d uh
dqn algorithm uh is um is using the um
backup B which is the backup for qar um
remember that I also introduced this
other backup BP which is the backup for
qpy um so so there's another algorithm
like a very classic algorithm called
sarsa uh which um is an online way of um
doing the bpy backup essentially um well
it's sort of an online version of policy
iteration um uh but uh so it's it's
actually um found to work as well um or
better than DQ well better than using
the B backup in some settings um not all
settings so I think the jury's still out
um on exactly um how these things
compare um but uh it's I think um it's
worth considering both policy iteration
and value iteration and and all the
different online versions of these
algorithms and taking them seriously
because it's not clear right now exactly
which are how how they all compare each
other in the function approximation
setting okay so that's uh that's the
overview of all the technical parts and
now I just have a couple conclusion
slides um so so let me just summarize
the current state of affairs I
introduced uh two kinds of algorithms uh
policy grading algorithms which
explicitly represent a policy and
optimize it and um Q function learning
algorithms which explicitly represent a
q function which is the goodness of
different actions and use that to
implicitly represent a policy um so so
policy gradient methods there's a lot of
um uh so there have been some successes
with different kinds different variants
of it so you have vanilla policy
gradient methods um there is a recent
paper um on this uh a3c method um which
is an async uh implementation of it uh
which gets very good results um there's
also um another kind of methods are the
natural policy gradient methods trust
region methods oh so the video I showed
you was obtained using uh trust region
policy optimization which is one of
these in the second category so that
makes it um I think these trust region
methods and natural policy gradient
methods are are uh More Sample efficient
than the um vanilla methods because uh
you end up um you're doing more than one
uh grading update with each little bit
of data you collect so with the vanilla
policy gradient you just compute one
little gradient estimate and then you
throw it away with natural policy
gradient you're solving a little
optimization problem with it so you get
more juice out of
it um so that's um that's what we have
in the policy gradient World um and uh
in the Q function world we have uh the
dqn algorithm and some of its relatives
um and these are sort of uh descendants
of value
iteration um where you're approximating
the Bellman backup using value iteration
um and then sarsa is um also it's
related to policy
iteration
um these are both different I mean these
are uh estimating different they're
dealing with different Bellman equations
so it's kind of interesting that both
kinds of methods work and they all
they're both they have fairly similar
behaviors as a turns
out um so here's what I would say the um
here's how I would compare them and this
is um like anecdotal evidence but uh I
think this is the consensus right now um
the Q function methods are more sample
efficient when they work um but uh they
don't work as generally as policy
gradient methods and it's it's a little
harder to figure out what's going on uh
when they don't work um and that kind of
makes sense because in the policy
gradient methods you're optimizing
exactly the thing you care about with
gradient descent whereas with Q function
methods you're doing something indirect
where you're optim you're trying to
learn a q function and then you're
hoping that it gives you a good
policy um and yeah so I would also point
out that there um there are also some
confounds so it's hard to make a good
conclusion at this point because people
use um uh
different um like time Horizons in the
policy gradient methods versus the Q
function methods so they do one step
look ahads on the Q functions and
multi-step look ahads on the policy
gradients so it's not clear if the extra
if the differences come from like using
different time Horizons or um some
differences in how the algorithms are
working because you're either doing
regression for a q function versus uh
learning a policy using policy
gradients um so just to summarize it I
would say here here are some of our core
model free reinforcement learning
algorithms and uh they oh whoops I'm
missing a word in the First Column which
I think should say uh like reliability
and robustness uh so this just means
like is it going to work on um new
problems without um like without
parameter tuning um or is it going to um
mysteriously either work or not work um
so this this would be my um slightly
sloppy summary of um all these different
algorithms I would say there's still
some room for improvement um there might
be some improvements in the basic
methods because uh there's some nice
properties of the Q function methods um
that we don't have in the policy
gradient methods like you can easily do
off you can easily um explore with a
different policy than the one that
you're um learning the Q function for
and that's really important um you can
do that very easily with policy grading
methods um whereas the policy grading
methods just seem like they're more um
you can just apply them and they're like
more likely to work and uh it's well
understood what's going
on so I think yeah there's still I don't
know if it's possible to get the best of
both worlds but that's uh that's the
Hope um and that's it for my talk thank
you
any
questions oh yeah so in model-based
reinforcement learning uh what lines of
research do I find most
interesting I think the work uh from my
colleagues on guided policy search is
very nice so I would say that's a kind
of modelbased reinforcement learning um
I also like um there's some methods that
are using the model for faster learning
like for variance reduction so there's a
paper called stochastic value gradients
that I like a lot
um I think it's a pretty wide openen
area so I don't think there have been a
lot of really compelling results uh
where you're able to learn extremely
fast uh like you're able to learn with
much better sample efficiency using a
model so it's seems like that should be
possible but I don't think it's been
demonstrated um yet so maybe in the next
couple years we'll see that
happen
hello uh
hi uh thanks for the talk so I have a
question is that is it true or not true
that um most of this problem require
some kind of simulated uh world to to uh
run experiments in the episodes right oh
yeah so um are you asking um does this
work in the real world is that the
question or does um yeah I would say um
it it does work if you have a lot of
patience and you're willing to execute
this thing for a while so the um
Locomotion results I showed um add up to
about two weeks of real time uh so it's
actually not that bad especially when
you consider uh that babies toddlers
take a while to learn how to walk
properly even though Evolution already
puts in a lot of uh built-in information
um so uh i' I'd say um maybe yeah I'd
say it it's it can be run in the real
world some of my colleagues in Berkeley
are doing uh some experiments where they
are running just regular reinforcement
learning algorithms in the real world um
very brave um but uh I hope to see some
nice results from that
soon thank
you hi uh thanks for your talk here on
the other side here um I was wondering
what was your intuition on the Lost
surface of those uh deep reinforcement
learning optimization problems and um
maybe especially how it evolves in the
from the as the policy learns and I
should specify in the policy G in
case so I think the situation is a
little bit different in reinforcement
learning from in supervised learning uh
so in reinforcement learning the uh loss
you have um you have one kind of local
Minima um in policy space um so for
example um let's say you want your so
I'm going keep going back to The
Locomotion example because I spent a lot
of time on it but uh let's say you want
your robot to walk um there's one local
minimum where it just stands and it
doesn't bother to walk because there's
too much penalty for falling over and
there's another local minimum where it
just Dives forward because uh it gets a
little bit of reward for that before it
falls to its Doom um so uh so even so I
think that that's actually the the hard
part about the uh like the optimization
problem is actually Define is because of
the different be space of behaviors and
actually has nothing to do with the
neural network um so I've also found
that um yeah it matters surprisingly
little what kind of architecture you use
um like what kind of neural network
architecture you use because I think the
most of the hardness and the weirdness
of the problem comes from uh like what
the Behavior space looks like rather
than what the actual numerical
optimization landscape looks
like cool thank
you so uh there are many problems where
uh the reward is only observed uh at the
end of the task so in the final in the
terminal state in each episode uh and
you don't see rewards uh in intermediate
States so how much harder do these
problems become for deep reinforcement
learning in your experience thanks yeah
so you have
if you don't get the reward until the
end then um then you can't
um well then it's probably it might be
harder to learn yeah I I don't have
anything uh anything precise to say
about that I think it's going to be
harder if you have less if your rewards
are further away yeah so so for example
for your uh in your video for the last
example of getting up and getting the
head above a certain height for example
that could be one where you only get a
plus one if you're above and you don't
get anything below are you doing
something that was kind of if you get
your head higher then you still get
something partial yeah so I think we uh
came up with a reward like distance from
height squared um which made the problem
easier um yeah the problem would have
been a lot harder if you get zero reward
until you get your head above the height
um and it's actually um that would be a
problem of exploration which is that you
have to um explore all the different
states be to figure out where you're
going to get good reward
thanks okay uh one last question okay uh
so I have a question about how do you
choose to quantize your space time
because in your Locomotion example you
clearly has the continuous
system right uh yeah so it's actually
really important how you Discord eyesee
time like what time step you use because
um if because the algorithm has um I
mean the algorithm does care about what
the time step is so it's not like um
yeah because you um you have discount
factors and you're also sampling a
different action at every time step so
um yeah so if you choose too small of a
Time step then uh you then the rewards
will be delayed by more time steps so
that makes the like the credit
assignment harder and also um your
exploration will be more like a random
walk because you're changing your mind
really frequently
so yeah the time step is pretty
important and I'd say that's um that's a
flaw in current methods okay thank
[Applause]
you thank you so take a short break we
convene in 15 minutes