MIT 6.S094: Deep Reinforcement Learning for Motion Planning
QDzM8r3WgBw • 2017-01-22
Transcript preview
Open
Kind: captions Language: en all right hello everybody Welcome Back glad you came back today we will unveil the first tutorial the first project this deep traffic code named deep traffic where your task is to solve the traffic problem using deep reinforcement learning and I'll talk about what's involved in designing a network there how you submit your own network how you participate in the competition as I said the winner gets a very special prize to be announced later what is machine learning the several types there's supervised learning as I mentioned yesterday that's what's uh meant usually when you discuss about you talk about machine learning and talk about its successes supervised learning requires a data set where you know the ground truth you know the inputs and the outputs and you provide that to a machine learning algorithm in order to learn the mapping between the inputs and the outputs in such a way that you can generalize to further examples in the future unsupervised learning is the other side when you know absolutely nothing about the outputs about the truth of the data that you're working with all you get is data and you have to find underlying structure underlying representation of the data that's meaningful for you to to accomplish a certain task whatever that is there's semi-supervised data where only Parts usually a very small amount is has is labeled there's ground truth available for just a small fraction of it if you think of images that are out there on the internet and then you think about imag net a data set where every image is labeled the size of that uh image not data set is a tiny subset of all the images on the U available online but that's the task we're dealing with as human beings as people interested in doing machine learning is how to expand the size of that of the part of our data that we know something confidently about and reinforcement learning sits somewhere in between it's it's semisupervised learning where there's an agent that has to exist in the world and that agent knows the inputs that the world provides but knows very little about that world except through occasional time delayed rewards this is what it's like to be human this is what life is about you don't know what's good and bad you kind of have to just live it and every once in a while you find out that all that stuff you did last week was pretty bad idea that's reinforcement learning that's semi supervised in the sense that only a small subset of the data comes with some ground truth some certainty that you have to then extract Knowledge from so first at the core of anything that's that works currently in terms of a pract in a practical sense there has to be some ground truth there has to be some truth that we can hold on to as we try to generalize and that's supervised learning even as in reinforcement learning the only thing we can count on is that truth that comes in a form of a reward so the standard supervised learning pipeline is you have some raw data the inputs you have ground truth the labels the outputs that matches to the inputs then you run a certain any kind of algorithm whether that's a neural network or another pre-processing processing algorithm that extracts the features from that data set you can think of a picture of a face that algorithm could extract the nose the eyes the corners of the eyes the pupil or even lower level features in that image after that we insert those features into a into a model a machine learning model we train that model then we as we whatever that algorithm is as we pass it through that training process we then evaluate after we've seen the exam this one particular example how much better are we at other tasks and as we repeat this Loop the model learns to perform better and better at generalizing from the raw data to the labels that we have and finally you get to release that model into the wild to actually do prediction on data has never seen before that you don't know about and the task there is to predict the labels okay so neural networks is what this class is about it's one of the machine learning algorithms that has has proven to be very successful and the building block the computational building block of a neural network is a neuron a perceptron is a type of neuron it's the original old school neuron where the output is binary a zero or one it's not real valued and the process that a perceptron goes through is it's has multiple inputs and a single output the inputs each of the inputs have weights on them shown here on the left as 7.6 1.4 those weights are applied to the inputs and if perceptron the inputs are ones or zeros binary and those weights are applied to and then sum together a bias on each neuron is then added on top and a threshold there's a test whether that summed Value Plus the bias is below or above a threshold if it's above a threshold produces a one if it's below a threshold it produces a zero simple it's one of the only things we understand about neural networks confidently we can prove aot a lot of things about this neuron for example what we know is that a neuron can approximate uh approximate a n gate a n gate is a logical operation a logical function that takes his input has two inputs A and B here on the DI on the diagram on the left and the table shows what that function is when the inputs are zeros 01 in any order the output is a one otherwise it's a zero the cool thing about an Nate is that it's a universal gate that you can build up any computer you have where you phone in your pocket today can be built out of just nand Gates so it's a it's a functionally complete you could build any logical function out of them if you stack them together in arbitrary ways the problem with Nan Gates and computers is they're built from the bottom up you have to design these circuits of Nan Gates so the cool thing here is with a perceptron we can learn this magical land gate we can learn this function so let's go through what how we can do that how a perceptron can perform the nand operation here's the four examples if we put the weights of -2 on the on each of the inputs and a bias of three on the neuron then if we perform that same operation of summing the weights times the inputs plus the bias in the top left we get when the inputs are zeros and there's sum to the bias we get a three that's a that's a positive number which means the output of a perceptron will be a one in the top right when the input is a zero and a one that sum is still a positive number again produces a one and so on when the inputs are both ones then uh the output is a negative 1 Less Than Zero so while this is simple it's really important to think about it's it's a sort of uh the one basic computational truth you can hold on to as we talk about some of the magical things neural network can do because if you compare a circuit of Nan Gates and a circuit of neurons neurons the difference while a circuit of neurons which is what we think of as a neural network can perform the same thing as the circuit of Nang Gates what it can also do is it can learn it can learn the arbitrary logical functions that a arbitrary uh Circuit of Nan Gates can represent but it doesn't require the human designer we can evolve if you will so one of the key aspects here one of the key drawbacks of perceptron is it's not very smooth in its output as we change the weights on the inputs and we change the bias and we tweak it a little bit it's very likely that when you get it's very easy to make make the neuron output a a zero instead of a one or a one instead of a zero so when we start stacking many of these together it's hard to control the output of the thing as a whole now the essential step that makes a neural network work that a circuit of perceptrons doesn't is if the output is made smooth is made continuous with an activation function and so instead of using a step function like a perceptron does shown there on the left we use any kind of smooth function sigmoid where the output can change gradually as you change the weights and the bias and this is a a basic but critical step and so learning is generally the process of adjusting those weights gradually and seeing how it has an effect on the rest of the network you just keep tweaking weights here and there and seeing how much closer you get to the ground truth and if you get farther away you just adjust the weights in the opposite direction that's neural networks in a nutshell there is what we will mostly talk about today is feed forward neural networks on the left going from inputs to outputs with no Loops there is also these amazing things called recurrent e networks they're amazing because they have memory they have a memory of State they remember the temporal dynamics of the data that went through but the painful thing is that they're really hard to train today we'll talk about feed foral neural networks so let's look at this example an example of stacking a few a few of these rounds together let's think of the task the basic tasks now famous using the classification of numbers you have an image of a number handwritten number and your task is given that image to say what number is in that image now what is an image an image is a collection of pixels in this case 28 by 28 pixels that's a total of 78 four numbers those numbers are from 0 to 255 and on the left of the network we the size of that input despite the diagram is 784 neurons that's the input then comes the hidden layer it's called The Hidden layer because it has it has no inter action with the input or the output it is simply a block used the it's at the core of the computational power of neural networks is the hidden layer it's tasked with forming a representation of the data in such a way that it maps from the inputs to the outputs in this case there is 15 neurons in the hidden layer there is 10 values on the output uh corresponding to each of the numbers there's several ways you could you can build this kind of network and this is what the magic of neural networks is you can do it in a lot of ways you only really need four outputs to represent values 0 through 9 but in practice it seems that having 10 outputs works better and how do these work whenever the input is a five the the output neuron in charge of the five gets really excited and output a value that's close to one from 0 to one close to one and then the other ones get uh output a value hopefully that is close to zero and when they don't we adjust the weights in such a way that they get closer to zero and closer to one depending on whether this the correct neuron associated with the picture we'll talk talk about the details of this training process more tomorrow when it's more relevant but the what we've discussed just now is the forward pass through the network it's the pass when you take the inputs apply the weights sum them together add the bias produce the output and check which of the outputs produces the highest confidence of the number then once those prob abilities for each of the numbers is is provided we determine the gradient that's used to punish or reward the weights that resulted in either the correct or the incorrect decisions and that's called back propagation we step backwards through the network applying those punishments or rewards because of the smoothness of the activation functions that is a mathematically efficient operation that's where the GP gpus step in so for our example of numbers the ground Truth for number six looks like the following in the slides y ofx equals to a uh 10-dimensional Vector where only one of them the the sixth Valu is of one the rests are zero that's the ground truth that comes with the image the loss function here the basic loss function is the squared error Y ofx is the ground truth and a is the output of the neural network resulting from the forward pass so when you input that number of a of a six and it outputs whatever it outputs that's a a 10-dimensional vector and it's summed over the inputs to produce the squared error that's our loss function the loss function the objective function that's what's used to determine how much to reward or punish the back propagated weights throughout the network and the basic operation of optimizing that loss function of minimizing that lost function is done with various variants of of gradient descent it's hopefully a somewhat smooth function but it's a highly nonlinear function this is why we can't prove much about Neal networks is it it's a highly High dimensional highly nonlinear function that's hopefully smooth enough where the gradient descent can find its way to at least a good solution and there has to be some stochastic element there that uh doesn't get that jumps around to ensure that it doesn't get stuck in a local Minima of this very complex function okay that's supervised learning there's there's inputs there's outputs ground truth that's our comfort zone CU we're pretty confident we know what's going on all you have to do is just you have this data set you train and you train a network on that data set and you can evaluate it you can write a paper and try to beat a previous paper it's great the problem is when you then use that neural network to create an intelligent system that you put out there in the world and now that system no longer is working with your data set it has to exist in this world that's doesn't that's may be very different from the ground truth so the takeaway from supervised learning is that neural networks's a great memorization but in a sort of philosophical way they might not be great at generalizing at reasoning beyond the specific flavor of data set that they were trained on the hope for reinforcement learning is that we can extend the knowledge we gain in a supervised way to the huge World outside where we don't have uh the ground truth of how to act of what does how good a certain state is or how bad a certain state is this is a kind of Brute Force reasoning and I'll talk about kind of what I mean there but it feels like it's closer to reasoning as opposed to memorization that's a good way to think of supervised learning as memorization you're just studying for an exam and as many of you know that doesn't mean you're going to be successful in life just because you get an A and so uh a reinforcement learning agent or just any agent a human being or any machine existing in this world can operate in the following way from the perspective of the agent it can execute an action it can receive an observation resulting from that action in a form of a new state and it can receive a reward or a punishment you can break down every our you can you could break down our existence in this way simplistic view but it's a convenient one on the computational side and from the environment side the environment receives the action ction emits the observation so your action changes the world therefore that world has to change and then tell you about it and give you a reward or a punishment for it so let's look at again one of the most fascinating things uh I'll try to convey why this is fascinating a little bit later on on is the work of Deep Mind on Atari this is Atari Breakout a game uh where a paddle has to move around that's the world that's existing in it's a paddle the agent is a paddle and there's a bouncing ball and you're trying to move your actions to right move right move left inste you're trying to move in such a way that the ball doesn't get past you and so here is a human level performance of that agent and so what does this paddle have to do has to operate in this environment it has to act move left move right each action changes the state of the world this see may seem obvious but moving right changes visually the state of the world in fact what we're watching now on the slides is the world changing before your eyes for this little guy and it gets Rewards or punishments rewards it gets in the form of points they're racking up points in the top left of the of the video and then when the ball gets past the paddle it gets punished by dying quote unquote and that's the number of lives there has left going from 5 to four to three down to zero and so the goal is to select at any one moment the action that maximizes future reward without any knowledge of what a reward is in a greater sense of the word all you have is an instantaneous reward or punishment instantaneous response of the world to your actions and this can be model as a a markof decision process markof decision process is a mathematically convenient construct it has no memory all you get is you have a state that you're currently in you perform an action you get a reward and you find yourself in a new state and that repeats over and over you start you from state zero you go to state one you once again repeat an action get a reward go to the next state okay that's that's the formulation that we're operating in when you're in a certain State you have no memory of what happened two states ago everything is operating on the instantaneous instantaneously and so what are the major components of a reinforcement learning agent there's a policy that's that's the agent the function uh broadly defined of an agent's Behavior that means that includes the knowledge of how for any given State what is an action that I will take with some probability value function is how good each state and action are in any particular State and there's a model now this is a little uh a subtle thing that is actually the biggest problem with everything you'll see today is the model is how we represent the environment and what you'll see today some amazing things that neural networks can achieve on a relatively simplistic model of the world and the question whether that model can extend to the real world where human lives are at stake in the case of driving so let's look at the simplistic World a robot and a room you start at the bottom left your goal is to get to the top right your possible actions are going up down left and right now this world could be deterministic which means when you take when you go up you actually go up or it could be non-deterministic as human life is is when you go up sometimes you go right so in this case if if you choose to go up you move up 80% of the time you move left 10% of the time and you move right 10% of the time and when you get to the top right you get a reward of plus one when you get to the second block from that 42 you get negative one you get punished and every time you take a step you get a slight punishment of 04 okay so question is if you start at the bottom left is this a good solution is this a good policy by which you exist in the world and it is if the world is deterministic if whenever you choose to go up you go up when you choose to go right you go right but if the actions are sarcastic that's not the case in what I described previous previously with 08 up and probability of 0.1 going left and right this is the optimal policy now if we punish every single step with a -2 as opposed toga 04 so every time you take a step it hurts you're going to try to get to a a positive block as quickly as possible and that's what this policy says I'll walk through a negative one if I have to As Long As I Get stop getting the -2 now if the reward for each step is A.1 you might choose to go around that negative one block slight detour to avoid the pain and then you might take an even longer detour as the reward for each step goes up or the punishment goes down I guess and then and then if the There's an actual positive reward for every step you take then you'll never you'll avoid going to your the finish line you'll just wander the the world we saw that with the uh Coast racer yesterday the uh the boat that chose not to finish the race because it was having too much fun getting points in the middle so let's look at the world that this agent is operating in is a value function that value function depends on a reward the reward that comes in the future and that reward is discounted because the world is stochastic we don't we can't expect the reward to come along to us in the way that we hope it does based on the policy based on the way we choose to act and so there's a gamma there that over time as the reward is farther and farther into the future this SCS that reward diminishes the impact of that future reward in your evaluation of the current state and so your goal is to develop a strategy that maximizes the discounted future reward this sum this discounted sum and reinforcement learning there is a lot of approaches for coming up with a good policy a near optimal an optimal policy there's a lot of fun math there you can try to construct a model that optimizes some estimate of this world you can try in a Monte Carlo way through just simulate that world and see how it unrolls and as it unrolls you try to compute the optimal policy or what we'll talk about today is q-learning it's an off- policy approach where the policy is estimated as we go along the policy is represented as a q function the Q function shown there on the left is I apologize for the equations I lied there'll be some equations the the input to uh the Q function is a state at time t s t and an action that you choose to take in that state at and your goal is in that state to choose an action which maximizes the reward in the next step and and what Q learning does and I'll describe the process is it's able to approximate through experience the optimal Q function the optimal function that tells you how to act in any state of the world you just have to live it you have to simulate this world you have to move about it you have to explore in order to see every possible State try every different action get rewarded get punished and figure out what is the optimal thing to do that's done using this Bellman equation on the left the output is the new state the estimate the Q function estimate of the new state for new action and that's this is the update rule at the core of Q learning you take the estimate the old estimate and add based on the learning rate Alpha from 0 to one B update the evaluation of that state based on your new reward that you received at that time so you've arrived in a certain State St you try to do an action and then you got a certain reward and you update your estimate of that state and action pair based on this rule when the learning rate is zero you don't learn when Alpha is zero you never change your world view based on the new the new incoming evidence when Alpha is one you every time change your evaluation your world evaluation based on the new evidence and that's the key ingredient to reinforcement learning first you explore then you exploit first you explore in a non-g greedy way and then you get greedy you figure out what's good for you and you keep doing it so if you want to learn in in atar game first you try every single action every state you screw up get punished get rewarded and eventually you figure out what's actually the right thing to do and you just keep doing it and that's how you win against the the greatest world the greatest human players in the world in a game of Go For example as we'll talk about and the way you do that is you have an Epsilon greedy policy that over time with a probability of 1 minus Epsilon you perform an optimal greedy action with the probability of Epsilon you perform a random action random action being being explore and so as Epsilon goes down from 1 to zero you explore less and less so the algorithm here is really simple on the bottom of the slide there it's the algorithm version the pseudo code version of the equation the Bellman equation update you initialize your estimate of State action pairs arbitrarily a random number now this is an important point when you start playing or living or doing whatever you're doing in whatever you're doing with reinforcement learning or driving you have no preconceived notion of what's good and bad it's random or however you choose to initialize it and the fact that it learns anything is amazing I want you remember that that's one of the amazing things about the Q learning at all and then the Deep neural network version of Q learning the algorithm repeats the following step you step into the world observe an initial State you select an action a so that action if you're exploring will be a random action if you're greedily pursuing doing the best action you can it'll be the action that maximizes the Q function You observe a reward after you take the action and a new state that you find yourself in and then you update your estimate of the previous state you were in having given taken that action using that Bellman equation update and repeat this over and over and so there on the bottom of the slide is a summary of Life yes uh the Q function yes yes yeah it's a single the question was is the Q function a single value and yes it's it's just a single uh continuous value so the question was how do you model the world so the way you model so let's start this very simplistic world of Atari uh paddle you think you model it as a paddle that can move left and right and there's some blocks and you model the physics of the uh the ball uh that requires a lot of expert knowledge in that particular game so you sit there handcrafting this model that's hard to do even for a simplistic game the other model you could take is looking at this world in the way that humans do visually so take the model in as a set of pixels just the model is all the pixels of the world you know nothing about padd or balls or physics or colors and points they're just pixels coming in that seems like a ridiculous model of the world but it seems to work for Atari it seems to work for human beings when you're born you see there's light coming into your eyes and you don't have any as far as no as far as we know you don't come with an instruction when you're born you know there's people in the world and there is there's good guys and bad guys and there this is how you walk no all you get is light sound and the other sensors um and and you get to learn about the every single thing you think of as the way you model the world is a learned representation and we'll talk about how a neural network does that it learns to represent the world but if we have to hand model the world it's an impossible task it's that's that's the question if we have to hand model the world then that world better be a simplistic one yeah that's a great question so the question was what is the robustness of this model if if the way you represent the world is at all even slightly different from the way you thought that world is uh that's not that well studied as far as I'm aware you I mean it's it's already amazing that if you construct if you have a certain input of the world if you have a certain model of the world that you can learn anything is already amazing the question is and it's an important one is uh we'll talk a little bit about it not about the world model but the reward function if the reward function is slightly different the real reward function of life or of driving or of Coast Runner is different then what you expected it to be what's the what's the negative there yeah it's it's uh it could be huge so there was another question or no never mind yep sorry can you ask that again yes you can change it over so the question was uh do you change the alpha value over time and you certainly should change the alpha value over time yeah else so the question was what is the the complex interplay of the Epsilon function with the Q learning update um that's 100% fine-tuned hand tuned to the particular learning problem so uh you certainly want to the more complex the the the larger the number of states in the world and the larger the number of actions the longer you have to wait before you decrease the Epsilon to zero but you have to play with it and it's one of the parameters you have to play with unfortunately and there's quite a few of them which is why you can't just drop a reinforcement learning agent into the world oh the effect in that sense no no it's just a coin flip and if if that Epsilon is 0.5 half the time you're going to take a random action so it's no there's no specific it's not like uh you'll take the best action and then with some probability take the second best and so on I mean you can certainly do that but in the simple formulation that works is you just take a random action because you don't want to have a preconceived notion of what's a good action to try when you're exploring the whole point is you try crazy stuff if it's a simulation okay so uh good question so representation matters this is the question about how we represent the world so we can think of this world of a breakout for example of the satari game as a paddle that moves left and right and the exact position of the different things it can hit construct this complex model uh this expert driven model that has to fine-tune it to this particular problem but in practice the more complex this model gets the worse that Bellman equation update that Val the trying to construct a q function for every single combination of state and actions becomes too difficult because that function is too sparse and huge so if you think of of looking at this world in a general way in the way human beings would is it's a collection of pixels visually if you just take in a pixel this game is a collection of 84 by 84 pixels an image RGB image and then you look at not just the current image but look at the temporal trajectory of those images so like if there's a ball moving you want to know about that movement so you look at four Images so kerm image and three images back and say they're Grace scale with 256 gray levels that size of the Q table that the Q value function has to learn is whatever that number is but it's certainly larger than the number of atoms in the universe that's a large number so you have to run the simulation long enough to touch at least a few times the uh most of the states in that Q table so as Elon Musk says you may need to uh run you know we live in a simulation you may run have to run another a universe just to to uh to compute the the the CU function in this case so that's where deep learning steps in is instead of modeling the world as a q table you estimate you try to learn that function and so the takeaway from supervised learning if you remember that it's good at memorizing we're good at memorizing data the hope for reinforcement learning with a q q learning is that we can extend the occasional rewards we get to generalize over the operation the actions you take in that world leading up to the rewards and the hope for deep learning is that we can move this reinforcement learning system into a world that doesn't need to be uh that can be defined arbitrarily can include all the pixels of an Atari game can include all the pixels sensed by drone or a robot or a car but still needs a formalized definition of that world which is much easier to do when you're able to take in sensors like an image so deep Q learning deep version so instead of learning a q table a q function we try and estimating that Q Prime we try to learn it using machine learning so it tries to learn something parameters this huge complex function we tried to learn it and the way we do that is we have a neural network the same kind that I showed that learn the numbers to map from an image to a uh classification of that image into a number the same kind of network is used to take in a state and an action and produce a q value now here's the amazing thing that without knowing anything in the beginning as I said with a Q table it's initialized randomly the Q function through this deep Network knows nothing in the beginning all it knows is in the simulated world the rewards you get for a particular game so you have to play play time and time again and see the the rewards you get for every single iteration of the game but in the beginning it knows nothing and it's able to learn to play better than human beings this is a deep mind paper playing Atari with deep reinforcement learning from 2013 that's one of the key things that got everybody excited about the of deep learning and artificial intelligence is that using a convolutional network which I'll talk about tomorrow but it's a it's a vanilla Network like any other like I talked about earlier today just a regular Network that takes the raw pixels as I said and estimates that Q function from the raw pixels is able to play on many of those games better than a human being and the loss function that I mentioned previously so again very uh vanilla loss function very simple objective function the the first one you'll probably implement we have a tutorial in tensor flow squared error so we take this Bellman equation where the estimate is Q the Q function estimate of state and action is the maximum reward you get for taking any of the actions uh that takes you to any of the future States and you try to take that action observe that the result of that action and if the target is different that your learned Target the learned what the function is learned is the expected reward in that case is different than what you actually got you adjust it you adjust the weights on the network and this is exactly the process by which we learn how to exist in this pixel world so you're mapping States and actions to a q value the algorithm is as follows this is how we train it we're given a transition s current state action taken in that state R the reward you get and S S Prime is what you the state you find yourself in and so we replaced the basic update rule in the previous pseudo code by taking a forward pass through the network given that s State we look at what the predicted Q value is of that action we then do another forward pass through that Network and see what we actually get and then if we're totally off we punish we uh back propagate the weights in a way that a next time will make less of that mistake and you repeat this process and this is a you're playing your this is a simulation you're learning against yourself and again the same rule applies here exploration versus exploitation you start out with an Epsilon of zero or one you're you're mostly exploring and then you move towards an Epsilon of zero and with the tari breakout this is the Deep Mind paper result is training epics on the x-axis on the y- AIS is the average action value and the average reward per episode I'll show why it's kind of an amazing result but it's messy because there's a lot of tricks involved so it's not just putting in a bunch of pixels of a game and getting uh an agent that knows how to win at that game there's a lot of pre-processing and playing with the data required so which is unfortunate because uh the truth is messier than the hope but one of the critical tricks needed is called experience replay so as opposed to letting an agent so you're learning this big Network that learn that tries to build a model of what's good to do in the world and what's not and you're learning as you go so with with experience replay you're keeping a track of all the things you did and every every once in a while you look back into your memory and pull out some of those old experiences the old good old times and train on those again as opposed to letting the agent run itself into some local Optima where it tries to learn a very subtle aspect of the game that actually in the global sense doesn't get you farther to winning the game very much like life so here's the algorithm deep Q learning algorithm cedo code we initialize the replay memory again there's this is a this little trick that's required is keeping a track of stuff that's happened in the past we initialize the action value function Q with random weights and observe initial State again same thing select an action with a probability Epsilon explore otherwise choose the best one based on the estimate provided by the neural network and then carry out the action observe the reward and store that experience in the replay memory and then sample random transition from replay memory so uh with a certain probability you bring those old times back to get yourself out of the local Minima and then you train the Cure the Q network using the the difference between your what you actually got and your estimate you repeat this process over and over so here's what you can do after 10 minutes of training on the left uh so that's very little training what you get is a paddle that learns hardly anything and it just keeps dying if you look at it goes from 5 to 4 to 3 to two to one those are the number lives left then after two hours of training and a single GPU it learns to Win It win the you know not die uh rack up points and uh learns to avoid the ball from passing passing the paddle which is great that's human level performance really better than Some Humans you know but it uh it still dies sometimes so it's very human level and then after 4 hours it does something really amazing it figures out how to win at the game in a very lazy way which is drill a hole through the the blocks up to the top and get the ball stuck up there and then it does all the hard work for you that minimizes the probability of the ball getting past your paddle cuz it's just stuck in the in the um in the blocks up top so it that might be something that you wouldn't even figure out to do yourself and that's an I I need to sort of pause here to to to clearly explain what's happening the input to this algorithm is just the pixels of the game it's the same thing that human beings take in when they take the visual perception and it's able to learn under this constru trained definition of what is a reward and a punishment it's able to learn to get a high reward that's General artificial intelligence a very small example of it but it's General it's general purpose it knows nothing about games it knows nothing about paddles or physics it's just taking sensory input of the game and they've uh did the same thing for a bunch of different games in Atari and uh what's shown here in this plot on the x-axis and um is a bunch of different games from Atari and on the y- AIS is a percentile where 100% is about the best that human beings can do meaning it's the score that human beings would get so everything about there in the middle everything to the left of that is far exceeding human level performance and below all that is on par or worse than human level performance so you can learn all so many boxing pinball all of these games and it doesn't know anything about any of the individual games it's just taken in pixels it's just as if you put a human being beside behind any of these uh games and ask them to learn to beat beat the game and there's been a lot of improvements on this algorithm recently yes question no nope there's no so the question was do they customize the model for game for a particular game and no the point you could of course but the point is it doesn't need to be customized for the game but but the important thing is that it's still only on Atari games right so the question whether this is transferable to driving perhaps not right you play the game well you do no you don't have the well yeah you play one step of the game so you take action in a state and then you observe that so you have the it's simulation I mean that's really uh that's one of the biggest problems here is you require the simulation to uh in order to get the ground truth yes so that's a great question and or a comment the the comment was that uh for a lot of these situations the reward function might not change at all depending on your actions the rewards are really most of the time delayed 10 20 30 steps down down the line which is why it is amazing that this works at all that it's learning a it's learning locally and through that process of simulation of hundreds of thousand times runs to the game it's able to learn what to do now such that I get a reward later it it's uh if you just pause look at the math of it it's very simple math and look at the result it's incredible so there's a lot of improvements this one called the general reinforcement learning architecture or gorilla the cool thing about this in the simula world at least is that you can run deep re enforcement learning and distributed way you could do both the simulation in a distributed way you can do the learning in the distributed way you can you can generate experiences which is what this kind of diagram shows you can either from human beings or from simulation so for example the way that the way that Alpha go uh the Deep Mind team has beat the game of Go is they they learn from both expert games and by playing it itself so you can do this in a distributed way and you could do the learning a distributed way so you can scale and in this particular case the uh gorilla uh has achieved a better result than the uh dqn Network that's part of their nature paper okay so let me now get to driving for a second here where where does reinforcement learning where reinforcement learning can step in and help so this is back to the open question that I asked yesterday is driving closer to chess or to everyday conversation chess meaning it can be formalized in a simplistic way and we could think about it as an obstacle avoidance problem and once the obstacle avoidance is solved you just navigate that strain space you choose to move left you choose to move right in a lane you choose to speed up or slow down well if it's a game like chess which we'll assume for today as opposed to for tomorrow for today we're going to go with the one on the left and we're going to look at Deep traffic here's this game simulation where the goal is to achieve the highest average speed you can on this seven Lane Highway full of cars and so as a side note for students a requirement is they have to follow the tutorial that I'll present a link for at the end of this presentation and what they have to do is achieve a speed build a network that achieves a speed of 65 M hour High there is a leaderboard and you get to submit the model you come up with with a simple click of a button so all of this runs in the browser which is also another amazing thing and then you immediately or relatively so make your way up the leaderboard so let's look let's zoom in what is this uh what is this world two-dimensional world of traffic is what does it look like for the uh intelligence system we discretize that world into a grid showing here on the left that's the representation of the state there's seven lanes and every single Lane is broken up into blocks spatially and if there's a car in that block the length of a car is about three blocks three of those grid blocks then that grid is seen as occupied and then the red car is you that's the thing that's running in the intelligent agent there is on the left is the current speed of the red car actually says MIT on top and then you also have a count of how many cars you passed and if your network sucks then that number is going to get to be negative uh you can also change with a drop down the simulation speed from normal on the left to fast on the right so normal is so you know the fast speeds up the replay of the simulation the one on the left normal it feels a little more like real driving the there's a drop down for different display options uh the default is none in terms of stuff you show on the road then there is the learning input which is the while the whole Space is desized you can choose what your car sees and that's you could choose how far ahead it sees behind how far to the left and right it sees and so by choosing the learning input to visualize the learning input you get to see what you set that input to be then there is the safety system this is a system that protects you from yourself the way we've made this game is that it operates under something similar if you have some intelligence if you drive and you have um adaptive cruise control in in your car it operates in the same way it when it gets close to the car in front it slows down for you and it doesn't let you run the car to the left of you to the right of you off the road so it constrains the movement capabilities of your car in such a way that you don't hit anybody because then it would have to simulate collisions and it would just be a mess so it protects you from that and so you can choose to visualize that quote unquote safety system with the visualization box and then you can also choose to visualize the full map this is the full occupancy map that you get if you would like to provide as input to the network now that input for every single grid that it's a number it's not just a 01 whether there's a car in there it's the maximum speed limit which is 80 miles per hour don't don't go don't get Crazy 80 M hour this a speed limit that block when it's empty is set to the 85 miles uh 80 M hour and when it's occupied it's set to the number that's uh the speed of the car and then the blocks that you the red car is occupying is set to the number to a very large number much higher than the speed limit so safety system here shown in red are the parts of the grid that are not that your car can't move into question what's that y yes yes uh the question was what was the third option I just mentioned and uh it's you the red car itself you're yourself the blocks underneath that car are set to really high number it's the way for the algorithm to know for the learning algorithm to know that this these blocks are special so safety system shows red here if the car can't move into those blocks so any in terms of of uh when it lights up red it means the car can't speed up anymore in front of it and when the blocks to the left or to the right light up is red that means you can't change lanes to the left or right on the right of the slide you're free to go free to do whatever you want that's what that indicates is all the blocks are yellow safety system says you're free to choose any of the five actions and the five actions are move left move right stay in place accelerate or slow down and those actions are given as input that action is is uh that's what's produced by the what's called here the brain the brain takes in the current state as input the last reward and produces and through and learns uses that reward to train the network through backward function there it's back propagation and then ask the brain given the current state to give it the next action with a forward pass the forward function you don't need to know the operation of this function in particular this is not something you need to worry about but you can if you want you can customize this learning step the there is by the way what I'm describing now there's there's just a few lines of code right there in the browser that you can change and immediately uh well with a press of a button changes the simulation or the design of the network you don't need to have any special Hardware you don't need to do anything special and the tutorial cleanly outlines exactly all of these steps but it's kind of amazing that you can design a deep neural network that's part of the reinforcement learning agent so it's a uh deep Q learning agent right there in the browser so you can choose the s lane side variable which controls how many lanes to the side you see so when that value is zero you only look forward when that value is one you have one lane to the left one value to the right it's really the lane the radius of your perception system patches ahead is how far ahead you look patches behind is how far behind you look and so for example here the lane side equals two that means it looks two to the left two to the right obviously if two to the right is Offroad it provides a value of zero in those blocks if we set the patches behind to be 10 it looks 10 patches back behind starting at the one patch back is starting from the front of the car the scoring for the evaluation for the competition is your average speed over a predefined period of time and so the method we do we use to collect that speed is we um we run the agent 10 runs about 30 simulated minutes of game each and take the median speed of the 10 runs that's the score uh this is done server side and so uh so given that we gotten some for this for this code recently gotten some publicity online unfortunately this might be a dangerous thing to say there no cheating possible but because it's done server side and we did uh this is Javascript and it runs in the browser it's hopefully uh sandboxed so where you can't do anything tricky but we dare you to try you can uh try it locally to get an estimate and I'll there's a button that says evaluate and it gives you a score right back of how well you're doing with the the current Network that button is start evaluation run you press the button it does a progress bar and it gives you the average speed you can the there's a Code box where you modify all the variables I mentioned and the tutorial describes this in detail and then once you're ready you modify a few things you can press apply code it restarts it kills all the training that you've done up to this point or resets it and starts the training again so save often and there's a save button so the training uh is done on a separate thread in web workers which are exciting things that allow you to allow JavaScript to run amazingly in uh in a in a on multiple uh CPU cores in a parallel way so the simulation that scores this or the sorry the training is done a lot faster than real time a th000 frames a second a thousand movement steps a second this is all in JavaScript and the next day gets shipped to the main simulation from time to time as the training goes on so
Resume
Categories