MIT 6.S094: Deep Reinforcement Learning

MQ6pP65o7OM • 2018-01-25

Transcript preview

Open

Kind: captions
Language: en
today we will talk about deep
reinforcement learning the question we
would like to explore it's to which
degree we can teach systems to act to
perceive and act in this world from data
so let's take a step back and think of
what is the full range of tasks then
artificial intelligence system needs to
accomplish here's the stack from top to
bottom top the input bottom output the
environment at the top the world that
the agent is operating in sensed by
sensors taking in the world outside and
converting it to raw data interpretable
by machines sensor data and from that
raw sensor data you extract features you
extract structure from that data such
that you can input it make sense of it
discriminate separate understand the
data and as we discussed you form higher
and higher order representations a
hierarchy of representations based on
which the machine learning techniques
can then be applied once the machine
learning techniques the understanding as
I mentioned converts the data into
features into higher order
representations and into simple
actionable useful information we
aggregate that information into
knowledge we take the pieces of
knowledge extracted from the data
through the machine learning techniques
and to build a taxonomy a library of
knowledge and with that knowledge we
reason an aging estas to reason to
aggregate to connect pieces of data it's
seen in the recent past or the distant
past to make sense of the world that's
operating in and finally to make a plan
of how to act in that world based on its
objectives based on what it wants to
accomplished as I mentioned a simple but
commonly accepted definition of
intelligence is a system that's able to
accomplish complex goals so system
that's operating in the environment in
this world must have a goal must have an
objective function a reward function and
based on that it forms a plan and takes
action and because there operates in
many cases in the physical world
it must have tools effectors with which
it applies the actions to change
something about the world that's the
full stack of an artificial intelligence
system that acts in the world and the
question is what kind of task can such a
system take on what kind of task can an
artificial intelligence system learn as
we understand AI today we will talk
about the advancement of deeper
enforcement learning approaches and some
of the fascinating ways it's able to
take much of the stack and treat it as
an end-to-end learning problem but we
look at games we look at simple
formalized worlds while it's still
impressive beautiful and unprecedented
accomplishments it's nevertheless formal
tasks can we then move beyond games and
into expert tasks of medical diagnosis
of design and into natural language and
finally the human level tasks of emotion
imagination consciousness let's once
again review the stack in practicality
in the tools we have the input for
robots operating in the world from cars
to humanoid to drones as light our
camera radar GPS stereo cameras audio
microphone networking for communication
and the various ways to measure
kinematics with IMU
the raw sensory data is then processed
features of form to representations are
formed and multiple higher and higher
order representations
that's what deep learning gets us before
neural networks before the advent of
before the recent successes of neural
networks to go deeper and therefore be
able to form high order representations
of the data that was done by experts by
human experts today networks are able to
do that that's the representation piece
and on top of the representation piece
the final layers these networks are able
to accomplish the supervised learning
tasks the generative tasks and the
unsupervised clustering tasks through
machine learning that's what we talked
about a little in lecture one and we'll
continue tomorrow and Wednesday
that's supervised learning and you can
think about the output of those networks
as simple clean useful valuable
information that's the knowledge and
that knowledge can be in the form of
single numbers it could be regression
continuous variables it could be a
sequence of numbers it can be images
audio sentences text speech once that
knowledge is extracted and aggregated
how do we connect it in multi resolution
always form hierarchies of ideas connect
ideas the trivial silly example is
connecting images activity recognition
and audio for example if it looks like a
duck quacks like a duck and swims like a
duck we do not currently have approaches
that effectively integrate this
information to produce a higher
confidence estimate that is in fact the
duck and the planning piece the task of
taking the sensory information fusing
the sensory information and making
action control and longer-term plans
based on that information as we'll
discuss today
are more and more amenable to the
learning approach to the deep learning
approach but to date have been the most
successful and non learning optimization
based approaches like with the several
of the guest speakers we have including
the creator of this robot Atlas in
Boston Dynamics so the question how much
of the stack can be learned and to end
from the input to the output we know we
can learn the representation and the
knowledge from the representation and to
knowledge even with the kernel methods
of SVM and certainly with with neural
networks mapping from representation to
information has been where the primary
success of machine learning over the
past three decades has been mapping from
raw sensory data to knowledge that's
where the success the automated
representation learning of deep learning
has been a success going straight from
raw data to knowledge the open question
for us today and beyond is if we can
expand the red box there of what can be
learned and to end from sensory data to
reasoning so aggregating forming higher
representations of the extracted
knowledge and forming plans and acting
in this world from the raw sensory data
we will show the incredible fact that
we're able to do CERN exactly what's
shown here and to end with deeper
enforcement learning on trivial tasks in
a generalizable way the question is
whether that can then move on to
real-world tasks of autonomous vehicles
of humanoid robotics and so on that's
the open question so today let's talk
about reinforcement learning there's
three types of machine learning
supervised
unsupervised are the categories at the
extremes in relative to the amount of
human and human input that's required
for supervised learning every piece of
data that's used for teaching these
systems is first labeled by human beings
and unsupervised learning on the right
is no data is labeled by human beings in
between is some sparse input from humans
semi-supervised learning is when only
part of the data is provided by humans
ground truth and the rest must be
inferred generalized by the system and
that's what reinforcement learning Falls
reinforcement learning has shown there
with the cats as I said every successful
presentation must include cats they're
supposed to be Pavlov's cats and ringing
a bell and every time they ring a bell
they're given food and they learn this
process the goal of reinforcement
learning is to learn from sparse reward
data from learn from sparse supervised
data and take advantage of the fact that
in simulation or in the real world there
is a temporal consistency to the world
there is a temporal dynamics that
follows from state to state the state
through time and so you can propagate
information even if the information that
you're received about the the
supervision the ground truth is sparse
you can follow that information back
through time to infer something about
the reality of what happened before then
even if your reward signals were weak so
it's using the fact that the physical
world evolves through time and some some
sort of predictable way to take sparse
information and generalize it over the
entirety of the experience as being
learned so we apply this the two
problems today we'll talk about deep
traffic as a methodology deep
reinforcement learning so deep traffic
is a competition that we ran last year
and expanded significantly this year and
I'll talk about some of the details and
how the folks in this room can on your
smart phone today or if you have a
laptop training agent while I'm talking
training a neural network in the browser
some of the things we've added our we've
added the capability we've now turned it
into a multi agent deeper enforcement
learning problem where you can control
up to ten cars within your network
perhaps less significant but pretty cool
is the ability to customize the way the
agent looks so you can upload and people
have to an absurd degree have already
begun doing so uploading different
images instead of the car that's shown
there as long as it maintains the
dimensions shown here is a SpaceX rocket
the competition is hosted on the website
self-driving cars that MIT ID you slash
deep traffic will return to this later
the code is on github with some more
information a starter code and a paper
describing some of the fundamental
insights that will help you win at this
competition is an archive
so from supervised learning in lecture
one to today supervised learning we can
think of as memorization of ground truth
data in order to form representations
that generalizes from that ground truth
reinforcement learning is we can think
of as a way to brute force propagate
that information the sparse information
through time to to assign quality reward
to state that does not directly have a
reward to make sense of this world when
the rewards are sparse but are connected
through time you can think of that as
reasoning so the
through time is modeled in most
reinforcement learning approaches very
simply that there's an agent taking an
action in a state and receiving a little
reward and the agent operating in an
environment execute an action receives
an observed state and new state and
receives their reward this process
continues over and over and some
examples we can think of any of the
video games some of which we'll talk
about today like Atari breakout as the
environment the agent is the paddle each
action that the agent takes has an
influence on the evolution of the
environment and the success is measured
by some reward mechanism in this case
points are given by the game and every
game has a different point scheme that
must be converted normalized into a way
that's interpreted by the system and the
goal is to maximize those points
maximize the reward the continuous
problem of card pole by balancing the
goal is to balance the pole on top of a
moving cart the state is the angle the
angular speed the position of horizontal
velocity the actions are the horizontal
force applied to the cart and the reward
is one at each time step if the pole is
still upright
all the
first-person shooters the video games is
now Starcraft the strategy games in case
of first-person shooter and doom what is
the goal the environment is the game the
goal is to eliminate all opponents the
state is the raw game pixels coming in
the actions is moving up down left right
and so on and the reward is positive
when eliminating an opponent and
negative when the agent is eliminated
industrial robotics been packin with a
robotic arm the goal is to pick up a
device from a box and put it into a
container the state is the raw pixels of
the real world that the robot observes
the actions are the possible actions of
the robot the different degrees of
freedom are moving through those degrees
moving the different actuators to
realize of the position of the arm and
the reward is positive when placing a
device successfully and negative
otherwise everything could be modeled in
this way Markov decision process there's
a state as zero action a zero and reward
received a new state is achieved again
action rewards state action rewards
state until a terminal state is reached
and the major components of
reinforcement learning is a policy some
kind of plan of what to do in every
single state what kind of action to
perform a value function a some kind of
sense of what is a good state to be in
of what is a good action to take in a
state and sometimes a model that the
agent represents the environment with
some kind of sense of the environment
its operating in the dynamics of that
environment that's useful for making
decisions about actions let's take a
trivial example
a grid world of three by four twelve
squares we start at the bottom left and
their task with walking about this world
to maximize reward they're awarded at
the top right is a plus 1 and a 1 square
below that is a negative 1 and every
step you take is a punishment or is a
negative reward of 0.04 so what is the
optimal policy in this world now when
everything is deterministic perhaps this
is the policy when you start the bottom
left well because every step hurts every
step has a negative reward
then you want to take the shortest path
to the maximum square with a maximum
reward when the state space is
non-deterministic as presented before
with a probability of 0.8 when you
choose to go up you go up but with
probability 0.1 you go left and point 1
you go right unfair again much like life
that would be the optimal policy what is
the Keith observation here that every
single state in the space must have a
plan because you can't because then a
non-deterministic aspect of the control
you can't control where you're going to
end up so you must have a plan for every
place that's the policy having an action
an optimal action to take in every
single state now suppose we change the
reward structure and for every step we
take there's a negative reward is a
negative 2 so it really hurts there's a
high punishment for every single step we
take so no matter what we always take
the shortest path the optimal policy is
to take the shortest path to the to the
only spot on the board that doesn't
result in punishment if we decrease the
reward of each step to negative 0.1 the
policy changes whether
some extra degree of wandering
encouraged and as we go further and
further in lowering the punishment as
before to negative 0.04 more wandering
and more wandering is allowed and when
we finally turn the reward into positive
so every step it every step is increases
the reward then there's a significant
incentive to to stay on the board
without ever reaching the destination
kind of like college for a lot of people
so the value function the way we think
about the value of a state or the value
of anything in the environment is the
reward were likely to receive in the
future and the way we see the reward
were likely to receive as we discount
the future award because we can't always
count on it
here Gama further and further out into
the future more and more discounts
decreases the reward the importance of
the reward received and the good
strategy is taking the sum of these
rewards and maximizing it maximizing the
scoundrel ward
that's what reinforcement learning hopes
to achieve and with cue learning we use
any policy to estimate the value of
taking an action in a state so off
policy forget policy we move about the
world and use the bellman equation here
on the bottom to continuously update our
estimate of how good a certain action is
in a certain state so we don't need this
this allows us to operate in a much
larger state space in a much larger
action space we move about this world
through simulation or in the real world
taking actions and updating our estimate
of how good certain actions are over
I'm the new state at the left is the is
the updated value the old state is the
starting value for the equation and we
update that old state estimation with
the sum of the reward received by taking
action s tax action a and state us and
the maximum reward that's possible to be
received in the following states
discounted that update is decreased with
a learning rate the higher the learning
rate the more value we the the faster
will learn the more value we assigned to
new information that's simple that's it
that's Q learning the simple update rule
allows us to to explore the world and as
we explore get more and more information
about what's good to do in this world
and there's always a balance in the
various problem spaces we'll discuss
there's always a balance between
exploration and exploitation as you form
a better and better estimate of the Q
function of what actions are good to
take you start to get a sense of what is
the best action to take but it's not a
perfect sense it's still an
approximation and so there's value of
exploration but the better and better
your estimate becomes the less and less
exploration has a benefit so usually we
want to explore a lot in the beginning
and less and less so towards the end and
when we finally release the system out
into the world and wish it to operate
its best then we have it operate as a
greedy system always taking the optimal
action according to the q2 key value
function and everything I'm talking
about now is permit rised and our
parameters that are very important for
winning the deep traffic competition
which is using this very algorithm with
a neural network at its core so for sin
table representation of a cue function
where the y-axis is state four states s
one two three four and the x-axis is
actions a one two three four we can
think of this table as randomly
initiated or initiated initialized in
any kind of way that's not
representative of actual reality and as
we move about this world and we take
actions we update this table with the
bellman equation shown up top and here
slides now are online you can see a
simple pseudocode algorithm of how to
update it how to run this bellman
equation and over time the approximation
becomes the optimal cue table
the problem is when that cue table it
becomes exponential in size when we take
in raw sensory information as we do with
cameras with deep crash or with deep
traffic it's taking the full grid space
and taking that information the raw the
raw grid pixels of deep traffic and when
you take the arcade games here they're
taking the raw pixels of the game or
when we take go the game of go when it's
taking the units the the board the raw
state of the board as the input the
potential state space the number of
possible combinations of what states it
possible is extremely large larger than
we can certainly hold the memory and
larger that we can ever be able to
accurately approximate through the
bellman equation over time through
simulation through the simple update of
the bellman equation so this is where
deep reinforcement learning comes in
neural networks are really good
approximate errs they're really good at
exactly this task of learning this kind
of cue table
so as we started with supervised
learning or neural networks helped us
memorize patterns using supervised
ground true data and we'll move to
reinforcement learning that hopes to
propagate outcomes to knowledge deep
learning allows us to do so on much
larger state spaces are much larger
action spaces which means it's
generalizable it's much more capable to
deal with the raw stuff of sensory data
which means it's much more capable to
deal with the broad variation of real
world applications and it does so
because it's able to learn the
representations as we discussed on
Monday the understanding comes from
converting the raw sensory information
into into simple useful information
based on which the action in this
particular state can be taken in the
same exact way so instead of the cue
table instead of this cue function we
plug in a neural network where the input
is the state space no matter how complex
and the output is a value for each of
the actions that you could take input is
the state output is the value of the
function it's simple this is deep Q
Network DQ one at the core of the
success of deep mind a lot of the cool
stuff you see about video games D
queuing or variants of DQ and our play
this is water first with a nature paper
a deep mind the success came of playing
the different games including Atari
games
how are these things trained very
similar to supervised learning the
bellman equation up top
it takes the reward and the discounted
expected reward from future states the
loss function here for neural network
and you'll now work learners with a loss
function it takes the reward received at
the current state does a forward pass
through a neural network to estimate the
value of the future state of the best
action to take in the future state and
then subtract that from the forward pass
through the network for the current
state in action so you take the
difference between what your a Q
estimator
then you'll network believes the value
of the current state is and what it more
likely is to be based on the value of
the future states that are reachable
based on the actions you can take here's
the algorithm input is the state output
is the Q value for each action or in
this diagram input is the state in
action and the output is the Q value
it's very similar architectures so given
a transition of s a are s prime s
current state taking an action receiving
reward and achieving US prime state the
the update is to a feed-forward pass
through the network for the current
state do a feed-forward pass for each of
the possible actions taken in the next
state and that's how we compute the two
parts of the loss function and update
the weights using back propagation again
loss function back propagation is how
the network is trained this has actually
been around for much longer than the
deep mind a few tricks made it made it
really work experience replays the
biggest one so as the games are played
through simulation or if it's a physical
system as it acts in the world it's
actually collecting the observations
into a library of experiences and that
training is performed by randomly
sampling the library in the past by
randomly sampling the previous
experiences and batches so you're not
always training on the natural
continuous evolution of the system
you're training on randomly picked
batches of those experiences that's like
huge it's a it's a seems like a subtle
trick but it's a really important one so
the system doesn't over fit a particular
evolution of this of the game of the
simulation another important again
subtle trick as in a lot of deep
learning approaches the subtle tricks
make all the difference is fixing the
target network for the loss function if
you notice you have to use the neural
network thick the singly neural network
the gqi network to estimate the value of
the current state and action pair and
next so using it multiple times and as
you perform that operation you're
updating the network which means the
target function inside that loss
function is always changing so you're
the very nature your loss function is
changing all the time as you're learning
and that's a big problem for stability
that can create big problems for the
learning process so this little trick is
to fix the network and only update it
every safe thousand steps so as you
train the network the the network that's
used to compute the target function
inside the loss function is fixed it
produces a more stable computation on a
loss function so the ground doesn't
shift under you as you're trying to find
a minimal for the loss function the loss
function doesn't change in unpredictable
difficult to understand ways and reward
clipping which is always true with
general systems that are operating it's
seeking to operate in the generalized
way is for very for these various games
the points are different some some
points are low some points are high some
go positive and negative and they're all
normalized to a point where the good
points or the positive points are a 1
and negative points are a negative 1
that's reward clipping simplify the
reward structure and because a lot of
the games are 30 FPS or 60 FPS and the
actions are not it's not valuable to
take actions at such a high rate inside
of these as particularly Atari games
then you only take an action every four
steps while still taking in the frames
as part of the temporal window to make
decisions tricks but hopefully gives you
a sense of the kind of things necessary
for both seminal papers like this one
and for the more important
accomplishment of winning deep traffic
is that
the tricks make all the difference here
on the bottom is the circle is when the
technique is used in the x1 it's not
looking at replay and target takes
target network and experience replay
when both are used for the game of
breakout River raid sea quests and Space
Invaders
the higher the number the better it is
the more points achieved so when it
gives you a sense that when replay and
target both gives significant
improvements in the performance of the
system order of magnitude improvements
two orders of magnitude for breakup and
here is pseudocode of implementing dq1
the learning the key thing to notice and
you can look to the slides is the the
loop the while loop of playing through
the games and selecting the actions to
play is not part of the training it's
it's part of the saving the observations
the state action reward next state
observation is saving them into replay
memory into that library and then you
sample randomly from that replay memory
to then train the network based on the
loss function and with probability up up
top with the probability epsilon select
a random action that epsilon is the
probability of exploration that
decreases that's something you'll see in
deep traffic as well is the rate at
which that exploration decreases over
time through the training process you
want to explore a lot first and less and
less over time so this algorithm is
being able to accomplish in 2015 and
since a lot of incredible things things
that made the AI world think that we
were onto something that
general AI is within reach for the first
time that raw sensor information was
used to create a system that acts and
makes sense of the world make sense of
the physics of the world enough to be
able to succeed in it from very little
information but these games are trivial
even though there is a lot of them this
dqn approach has been able to outperform
a lot of the Atari games
that's what's been reported on
outperform the human level performance
but again these games are trivial what I
think and perhaps biased I'm biased but
one of the greatest accomplishments of
artificial intelligence in the last
decade at least from the philosophical
or the research perspective is alphago 0
first alphago and then alphago 0 its
deepmind system that beat the best in
the world in a game of go so what's the
game of go it's simple I won't get into
the rules but basically it's a 19 by 19
board shown on the bottom of the slide
for the bottom row of the table for a
board of 19 by 19 the number of legal
game positions is 2 times 10 to the
power of 170 it's a very large number of
possible positions to consider any one
time especially the game evolves the
number of possible moves is huge much
larger than in chess so that's why AI
the community thought that this game is
not solvable until 2016 when alphago
used this use human expert position play
to seed in a supervised way
reinforcement learning approach and I'll
describe in a little bit of detail and a
couple of slides here
to beat the best in the world and then
alphago 0 that is the accomplishment of
the decade for me in AI is being able to
play with no training data on human
expert games and beat the best in the
world in an extremely complex game this
is not Atari this is and this is a much
higher order difficulty game and that
and the quality of players that is
competing in is much higher and it's
able to extremely quickly here to
achieve a rating that's better than
alphago and better than the different
variants of alphago and certainly better
than the best of the human players in 21
days of self play so how does it work
all of these approaches much much like
the previous ones the traditional ones
that are not based on deep learning are
using Monte Carlo tree search MCTS
which is when you have such a large
state space you start at a board and you
play and you choose moves with some
exploitation exploration balancing
choosing to explore totally new
positions or to go deep in the positions
you know are good until the bottom of
the game is reached until the final
state is reached and then you back
propagate the quality of the choices you
made leading to that position
and in that way you learn the value of
of board positions and play that's been
used by the most successful go playing
engines before and alphago since but you
might be able to guess what's the
difference with alphago verse to the
previous approaches they use the neural
network
as the intuition quote-unquote - what
are the good states what are the good
next board positions to explore and the
key things again the tricks make all the
difference that made alphago zero work
and work much better than alphago is
first because there was no expert play
instead of human games
alphago used that very same Monte Carlo
tree search algorithm MCTS to do an
intelligent look ahead based on the
neural network prediction of where dove
the good States to take it checked that
instead of human expert play it checked
how good indeed are those states it's a
simple look ahead action that does the
ground truth that does the target
correction that produces the loss
function the second part is the
multitask learning what's now called
multitask learning is the networkers is
quote-unquote two-headed in the sense
that first it outputs the probability of
which move to take the obvious thing and
it's also producing a probability of
winning and there's a few ways to
combine that information and
continuously train both parts of the
network depending on the choice taken so
you want to take the best choice in the
short term and achieve the positions
that are highly a slightly hood of
winning for the player that's whose turn
it is and another big step is that they
updated from 2015 the updated of the
state-of-the-art architecture which are
now the architecture that one imagenet
as the residual networks ResNet for
imagenet those that's it
and those little changes made all the
difference so that takes us to deep
traffic and the eight billion hours
stuck in traffic
America's pastime so we tried to
simulate driving that behavior layer of
driving so not the immediate control not
the motion planning but beyond that on
top on top of those control decisions
the human interpretable decisions of
changing lane of speeding up slowing
down modeling that in a micro traffic
simulation framework that's popular in
traffic engineering the kind of shown
here we apply deep reinforcement
learning to that I'll call it deep
traffic the goal is to achieve the
highest average speed over a long period
of time weaving in and out of traffic
for students here the requirement is to
follow the tutorial and achieve a speed
of 65 miles an hour and if you really
want to achieve a speed over 70 miles an
hour which is what's acquired to win and
perhaps upload your own image to make
sure you look good doing it what you
should do clear instructions to compete
read the tutorial you can change
parameters in the code box on that
website cars done on mighty dad you size
deep traffic click the white button that
says apply code which applies the code
that you write these are the parameters
that you specify then you'll network it
applies those parameters creates the
architecture do you specify and now you
have a network written in JavaScript
living in the browser ready to be
trained then you click the blue button
that says run training and that trains
the network much faster than one's
actually being visualized in the browser
a thousand times faster by evolving the
game making decisions taking in the grid
space as I'll talk about here in a
second the speed limit is 80 miles an
hour based on the various adjustments
were made to the game reaching 80 miles
an hour is certainly impossible an
average and reaching some of the speeds
that we've achieved last year
it's much much much more difficult
finally when you're happy and the
training is done submit the model to
competition for those super eager
dedicated students you can do so every
five minutes and to visualize your
submission you can click the request
visualization specifying the custom
image and the color okay
so here's the simulation speed limit 80
miles an hour
cars 20 on the screen one of them is a
red one in this case that's that one is
controlled by a neural network its speed
it's allowed the actions of speed up
slow down change lanes left-right or
stay exactly the same the other cars are
pretty dumb they speed up slow down turn
left right but they don't have a purpose
in their existence they do so randomly
or at least purpose has not been
discovered the road the car the speed
the road is a grid space an occupancy
grid that specifies when it's empty
it's set to a B meaning that the the
grid value is whatever speed is
achievable if you were inside that grid
and when there's other cars that are
going slow the value in that grid is the
speed of that car that's the state space
that's the state representation and you
can choose how much what slice that
state space you take in that's the input
to the neural network for a visual
Asian purposes you can choose normal
speed or fast speed for watching the
network operate and there's display
options to help you build intuition
about the network takes in and what
space that car is operating in the
default is no extra information is added
then there's the learning input which
visualizes exactly which part of the
road the is serves as the input to the
network then there is the safety system
which I'll describe in a little bit
which is all the parts of the road the
car is not allowed to go into because it
would result in a collision and that
with JavaScript would be very difficult
to animate and the full map here's a
safety system you could think of this
system as a CC basic radar ultrasonic
sensors helping you avoid the obvious
collisions to obviously detectable
objects around you and the task for this
red car for the steel Network is to move
about this space is to move about the
space under the constraints of the
safety system the red shows all the
parts of the grid it's not able to move
into so the goal for the car is to not
get stuck in traffic it's make big
sweeping motions to avoid crowds of cars
the input like DQ n is the state space
the output is the value of the different
actions and based on the epsilon
parameter through training and through
inference evaluation process you choose
how much exploration you want to do
these are all parameters the learning is
done in the browser on your own computer
utilizing only the CPU the action space
there's five giving you some of the
variables here perhaps you go back to
the slides to look at it the brain quote
unquote is the thing that takes in the
state and the reward takes a four
passed through the state and produce to
the next action the brain is where the
neural network is contained both of the
training and the evaluation the learning
input can be controlled in width forward
length and backward length lane side
number of lanes to the side that you see
patches ahead as the patches ahead that
you see patches behind as patches behind
the you see mu this year can control the
number of agents that are controlled by
the neural network anywhere from one to
ten and the evaluation is performed
exactly the same way you have to achieve
the highest average speed for the agents
the very critical thing here is the
agents are not aware of each other so
they're not jointly jointly planning the
network is trained under the joint
objective of achieving the average speed
for all of them
but the actions are taking in a greedy
way for each it's very interesting what
can be learned in this way because this
kinds of approaches are scalable to an
arbitrary number of cars and you could
imagine us plopping down the best cars
from this class together and having them
compete in this way the best neural
networks because they're full in their
greedy operation the number of networks
that can concurrently operate is fully
scaleable there's a lot of parameters
the temporal window the layers the many
layers types that can be added here's a
fully connected layer with tenure ons
the activation functions all of these
things can be customized as specified in
the tutorial the final layer a fully
connected layer with output a five
regression giving the value of each of
the five actions and there's a lot of
more specific parameters some of which
have this
just from gamma to epsilon to experience
replay size to learning rate in temporal
window the optimizer the learning rate
momentum batch size l2 l1 to K for
regularization and so on there's a big
white button that says apply code that
you press that kills all the work you've
done up to this point so be careful
doing it it should be doing it only at
the very beginning if you happen to
leave your computer running in training
for several days as as folks have done
the blue training button you press and
it trains based on the parameters you
specify and the network state gets
shipped to the main simulation from time
to time so the thing you see in the
browser as you open up the web site is
running then the same network that's
being trained and regularly it updates
that network so it's getting better and
better even if the training takes weeks
for you it's constantly updating the
network you see on the left so if the
car for the network that you're training
is just standing in place and not moving
it's probably time to restart and change
the parameters maybe add a few layers to
your network number of iterations is
certainly an important parameter to
control and the evaluation is something
we've done a lot of worked on since last
year to remove the degree of randomness
to remove the the incentive to submit
the same code over and over again to
hope to produce a higher reward a higher
evaluation score the method for
evaluation is we collect the average
speed over ten runs about 45 seconds of
game each not minutes 45 simulated
seconds and there is five hundreds of
those and we take the median speed of
the 500 runs it's done server-side so
extremely difficult to cheat I urge you
to try you can try it locally there's a
start evaluation run but that one
doesn't count that's just for you to
feel better
by you network that's that should
produce a result that's very similar to
the one we were produced on the server
it's to build your own intuition and as
I said we significantly reduce the
influence of randomness so the the score
the speed you get for the network you
design should be very similar with every
valuation loading is saving if the
network is huge and you want to switch
computers you can save the network it
saves both the architecture of the
network and the weights and the on the
network and you can load it back in
obviously when you load it in it's not
saving any of the data you've already
done you can't do transfer learning with
javascript in the browser yet submitting
your network submit model to competition
and make sure you run training first
otherwise it'll be initiated the way to
initiate it randomly and will not do so
well
you can resubmit us off and you like and
the highest score is what counts the
coolest part is you can load your custom
image specify colors and request the
visualization we have not yet shown the
visualization but I promise you it's
going to be awesome again read the
tutorial change the parameters in the
code box click apply code run training
everybody in this room on the way home
on the train hopefully not in your car
should be able to do this in the browser
and then you can visualize request
visualization because it's an expensive
process you have to want it for us to do
it because we have to run in server-side
competition link is there github starter
code is there and the details for those
that truly want to win is in the archive
paper so the question that will come up
throughout is whether these
reinforcement learning approaches are at
all or rather if action planning control
is amenable to learning certainly in the
case of driving we can't do it alpha go
zero did we can
learn from scratch from self play
because that will result in millions of
crashes in order to learn to avoid the
crashes unless we're working like we are
deep crash on the RC car or we're
working in a simulation so we can look
at export data we can look at driver
data which we have a lot of and learn
from it's an open question whether this
is applicable to date and I'll bring up
two companies because they're both guest
speakers deep IRL is not involved in the
most successful robots operating in the
real world in the case of Boston
Dynamics most of the perception control
and planning like in this robot does not
involve learning approaches except with
minimal addition on the perception side
best of our knowledge and certainly the
same is true with Wei MO as the speaker
on Friday will talk about deep learning
is used a little bit in perception on
top but most of the work is done from
the sensors and the optimization base
the model-based approaches trajectory
generation and optimizing which
trajectory trajectory is best to avoid
collisions deep IRL is not involved and
coming back and back again
the unexpected local POC is a high
reward which arises in all of these
situations and apply in the real world
so for the cat video that's pretty short
where the cats are ringing the bell and
they're learning that the ring in the
bell is is mapping to food I urge you to
think about how that can evolve over
time in unexpected ways they may not
have a desirable effect where the final
reward is in the form of food and the
intended effect is to ring the bell
that's
ASAT comes in for the artificial general
intelligence course in two weeks that
something will explore extensively its
how these reinforcement learning
planning algorithms will evolve in ways
they're not expected and how we can
constrain them how we can design reward
functions that result in safe operation
so I encourage you to come to the talk
on Friday at 1:00 p.m. as a reminder so
1:00 p.m. not 7:00 p.m. in Stata 32 one
two three and two the awesome talks in
two weeks from Boston Dynamics to Ray
Kurzweil and so on for AGI now tomorrow
we'll talk about computer vision and
psyche fuse thank you everybody
[Applause]

Resume

Berikut adalah rangkuman komprehensif dan terstruktur dari konten video yang Anda berikan.

***

# Panduan Lengkap Deep Reinforcement Learning: Dari Teori Hingga Kompetisi DeepTraffic

### Inti Sari (Executive Summary)
Video ini membahas konsep dasar hingga penerapan lanjutan dari **Deep Reinforcement Learning (DRL)**, sebuah cabang kecerdasan buatan yang memungkinkan mesin belajar mengambil keputusan dari data sensor mentah hingga aksi nyata. Pembahasan mencakup perbandingan berbagai metode pembelajaran mesin, mekanisme *Q-Learning* dan *Deep Q-Networks (DQN)*, serta studi kasus terkenal seperti AlphaGo. Selain itu, video memperkenalkan kompetisi simulasi "DeepTraffic" dari MIT sebagai sarana praktis untuk menerapkan algoritma DRL dalam mengoptimalkan alur lalu lintas kendaraan otonom.

---

### Poin-Poin Kunci (Key Takeaways)
*   **Tumpukan AI (AI Stack):** Kecerdasan buatan memerlukan alur proses yang lengkap mulai dari lingkungan, sensor, pemrosesan data, pembentukan representasi, hingga penalaran dan eksekusi aksi.
*   **Reinforcement Learning (RL):** Berbeda dengan *Supervised Learning* yang menghafal data, RL belajar melalui *trial and error* dengan mendapatkan *reward* (hadiah) atau *punishment* untuk mencapai tujuan jangka panjang.
*   **Deep Q-Network (DQN):** Menggunakan jaringan saraf tiruan (*Neural Network*) untuk mengatasi keterbatasan memori pada tabel Q tradisional, dengan teknik penting seperti *Experience Replay* dan *Target Network* untuk stabilitas pelatihan.
*   **Evolusi AlphaGo:** AlphaGo Zero berhasil mengalahkan manusia tanpa menggunakan data permainan manusia sebelumnya, hanya melalui *self-play* dan *Monte Carlo Tree Search* (MCTS).
*   **Kompetisi DeepTraffic:** Sebuah simulasi mikro di mana peserta melatih agen (mobil merah) untuk menavigasi lalu lintas dengan kecepatan tertinggi menggunakan parameter jaringan saraf yang dapat dikustomisasi.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Konsep Dasar AI dan Tipe Pembelajaran Mesin
Video dimulai dengan menjelaskan "Tumpukan AI" (AI Stack) yang menggambarkan alur kerja sistem cerdas:
*   **Alur Proses:** Lingkungan (*Environment*) $\rightarrow$ Sensor $\rightarrow$ Data Mentah $\rightarrow$ Fitur/Representasi $\rightarrow$ *Machine Learning* (Pemahaman) $\rightarrow$ Pengetahuan $\rightarrow$ Penalaran/Perencanaan $\rightarrow$ Aksi (*Efektor*).
*   **Tantangan:** Saat ini AI sukses dalam game dan dunia formal, namun tantangan masa depan adalah tugas tingkat ahli seperti diagnosis medis dan tugas berbasis emosi atau kesadaran.
*   **Kategori Pembelajaran:**
    *   *Supervised Learning:* Semua data berlabel.
    *   *Unsupervised Learning:* Tidak ada data berlabel.
    *   *Reinforcement Learning (RL):* Berada di antara keduanya; inputnya jarang (*sparse*) dan menggunakan konsistensi temporal untuk menyebarkan informasi dari *reward*.

#### 2. Mekanisme Reinforcement Learning (RL)
RL berfokus pada bagaimana agen berinteraksi dengan lingkungan melalui siklus: *State* (Keadaan) $\rightarrow$ *Action* (Aksi) $\rightarrow$ *Reward* (Hadiah) $\rightarrow$ *New State*.
*   **Komponen Utama RL:**
    *   *Policy:* Rencana apa yang harus dilakukan di setiap keadaan.
    *   *Value Function:* Penilaian seberapa baik suatu keadaan atau aksi.
    *   *Model:* Representasi agen terhadap dinamika lingkungan.
*   **Contoh Penerapan:** *Atari Breakout* (menghapus balok), *Cart Pole* (menyeimbangkan tongkat), *Robotika Industri* (memindahkan barang), dan *First-Person Shooters* (eliminasi musuh).
*   **Markov Decision Process (MDP):** Proses pengambilan keputusan di mana hasilnya sebagian acak dan sebagian bergantung pada keputusan agen. Contoh *Grid World* digunakan untuk menjelaskan bagaimana biaya langkah (*step cost*) mempengaruhi kebijakan (*policy*) agen, apakah memilih jalur terpendek atau menghindari risiko.

#### 3. Deep Q-Networks (DQN) dan Teknik Stabilisasi
Ketika ruang keadaan (*state space*) terlalu besar (misalnya data piksel dari kamera), tabel Q tradisional tidak efektif. Solusinya adalah menggunakan **Deep Q-Networks (DQN)**.
*   **Cara Kerja:** Jaringan saraf menggantikan tabel Q, menerima input *state* mentah dan menghasilkan nilai (*value*) untuk setiap kemungkinan aksi.
*   **Proses Pelatihan:** Mirip dengan *supervised learning*, menggunakan fungsi kerugian (*loss function*) berdasarkan persamaan Bellman untuk memperbarui bobot jaringan melalui *backpropagation*.
*   **4 Trik Kunci untuk Keberhasilan DQN:**
    1.  **Experience Replay:** Menyimpan pengamatan dan melatih jaringan dengan sampel acak dari memori untuk mencegah *overfitting*.
    2.  **Fixed Target Network:** Menggunakan jaringan terpisah yang tetap (*fixed*) untuk menghitung target, memperbaruinya hanya setiap beberapa ribu langkah untuk mencegah ketidakstabilan.
    3.  **Reward Clipping:** Menormalisasi *reward* menjadi +1 (baik) dan -1 (buruk) untuk menyederhanakan struktur.
    4.  **Action Skipping:** Mengambil aksi setiap 4 frame untuk memberikan jendela waktu temporal pada pengambilan keputusan.

#### 4. Studi Kasus: AlphaGo dan AlphaGo Zero
Video menyoroti pencapaian besar AI dalam permainan papan Go yang sangat kompleks.
*   **AlphaGo (2016):** Menggunakan data permainan ahli manusia (*supervised learning*) dikombinasikan dengan RL untuk mengalahkan juara dunia.
*   **AlphaGo Zero:** Disebut sebagai "pencapaian dekade ini". Tidak menggunakan data manusia sama sekali. Hanya bermain melawan dirinya sendiri (*self-play*) menggunakan *Monte Carlo Tree Search* (MCTS).
    *   **Arsitektur:** Menggunakan jaringan "dua kepala" yang menghasilkan probabilitas langkah (*policy*) dan probabilitas kemenangan (*value*) secara bersamaan.
    *   **Hasil:** Dalam 21 hari, AlphaGo Zero mencapai peringkat lebih tinggi daripada versi sebelumnya dan manusia.

#### 5. Kompetisi DeepTraffic: Simulasi Lalu Lintas
Bagian terakhir menjelaskan penerapan praktis melalui kompetisi "DeepTraffic" di `self-driving-cars.mit.edu`.
*   **Tujuan:** Mencapai kecepatan rata-rata tertinggi dengan cara menyusuri mobil-mobil lain (*weaving*) di jalan raya.
*   **Aturan:**
    *   Batas kecepatan 80 mph.
    *   Terdapat 20 mobil di layar; satu mobil merah dikendalikan oleh Neural Network, sisanya bergerak acak ("dumb").
    *   Aksi yang tersedia: percepat, perlambat, pindah jalur kiri/kanan, atau tetap.
*   **Implementasi Teknis:**
    *   **State Space:** Menggunakan *occupancy grid* (posisi mobil di sekitar).
    *   **Parameter:** Peserta dapat mengatur lapisan (*layers*), fungsi aktivasi, *gamma*, *epsilon*, ukuran *batch*, dan laju pembelajaran (*learning rate*).
    *   **Antarmuka:** Tombol "Apply Code" (hanya di awal, mereset pekerjaan) dan tombol biru "Run Training" untuk melatih jaringan di browser.
*   **Evaluasi:** Kecepatan diambil dari median 10 kali *run* (masing-masing 45 detik) untuk menghilangkan faktor keberuntungan.

---

### Kesimpulan & Pesan Penutup
Deep Reinforcement Learning adalah teknologi yang sangat skalabel dan sedang berkembang pesat, membuka jalan bagi otonomi tingkat tinggi dalam robotika (seperti Boston Dynamics) dan kendaraan otonom (seperti Waymo). Video ini mengajak penonton untuk tidak hanya memahami teorinya, tetapi juga turut serta berpartisipasi dalam kompetisi DeepTraffic. Dengan bereksperimen pada parameter jaringan saraf dan mengamati perilaku agen, peserta dapat memahami secara langsung bagaimana AI "belajar" menavigasi dunia yang kompleks.

Read

file updated 2026-02-13 13:24:50 UTC