Kind: captions Language: en so good morning everyone so I'm going to talk about uh some of the core methods in deep reinforcement learning um so the aim of this talk is as follows um first I'll do a brief introduction to what deepl is and um whether it might make sense to apply it in your problem um I'll talk about uh some of the core uh techniques uh so they're on the one hand we have the policy gradient methods uh then on the other hand we have uh methods that uh learn a q function including Q learning and sarsa and um I'll talk a little at the end about what are the pros and cons of these different methods so first what is reinforcement learning um it's a branch of machine learning cons uh concerned with taking sequences of actions um so um often uh it's described in in terms of an agent inter interacting with the previously unknown environment um and it's trying to maximize some kind of cumulative reward some kind of reward function that we've defined um accumulated over time and uh pretty much any kind of task where you have some kind of goal that you want to achieve can be stated in these terms uh so this is an extremely General uh formulation uh what's deep reinforcement learning it's pretty simple it's just uh reinforcement learning where you're using uh neural networks uh as function approximators um so uh the interesting thing about reinforcement learning and contrast to supervised learning is um it's actually not totally obvious what you should use your neural network to approximate in reinforcement learning and there are different kinds of algorithms that approximate different things so uh one choice is to use the neural network to approximate your policy which is uh how the agent chooses its actions um another choice is to approximate the value functions which measure how good or bad uh different states are or how or actions and um last you can use the um you can try to learn a model of the system a Dynamics model uh which will make predictions about next States and rewards okay so I'll now give a few examples of different um different places where you might apply reinforcement learning and what the observations and uh actions would be uh so one example is robotics um so here you could imagine a robot where the observations are the camera images and the joint angles of the robot um the actions are the joint torqus you're applying and um the reward is going to depend on what you want the robot to do so so this is something we uh as the algorithm designer get to Define so uh the rewards could be uh to stay balanced uh to navigate to some Target location or something more abstract like Serve and Protect humans uh so reinforcement learning has also been used in a lot of um more practical applications um well applications that have been practical in the past uh I think robotics will be very practical in the future um but uh for example um one uh one area is um Inventory management uh so this is just one example of how you could use reinforcement learning for a decision-making problem uh so you you have to decide how much to stock up on uh of every item and uh your observations would be your current inventory levels um actions would be how much of each item you're going to purchase and uh reward is your profit uh so people in operations research this is uh this is a subfield um study this kind of problem a lot um okay there are also a lot of uh machine learning problems where people have started to apply reinforcement learning techniques so uh one example is um attention um so the idea in attention is you don't want to look at the whole input at once uh you want to just focus on part of it uh so uh one example of this is um with a large image you might want to just crop out part of it and uh use that and just do detection on that part of the image um so uh here your observation would be your current image window action is where to look or where to crop your image um and uh reward is um your whether you make a classification error or not so here the um the actions are trying to um here you have to um try to choose the right area of the image to look at so you'll do the correct classification um reinforcement learning has also been used in um structured prediction problems um which haven't uh which in the past weren't considered to be reinforcement learning problems uh but it turns out that um like to actually properly solve them it it actually is a reinforcement learning problem uh so machine translation for example um uh you so you get a sour a sentence in the source language and you have to emit a sentence in the target language um and uh you can uh here your observations are the sentence in the source language you're emitting one word at a time in the Target l language and uh you have some reward function that looks at the whole sentence and tells you how good your translation was um so since this is non-differentiable and it's um you yeah you can't just uh like differentiate through the whole thing and do gradiant to sent so um it turns out you have to do um you can use a policy gradient method to optimize your translation system um so people have started to do that okay so that's just those are just a few examples um not exhaustive at all um but uh I just want to uh since I just want to say a little bit about how reinforcement learning fits into um the um fits into the picture of all the other um types of machine learning problems so previous um I mean previous uh courses in this uh series have talked about uh supervised learning and un supervised learning so how does uh reinforcement learning relate to them how is it different um so let's just uh first compare it to let's look at supervised learning so in supervised learning first um the environment samples an input output pair from some distribution row um the agent makes a prediction um why hat using its function f and uh it receives some loss which tells it if it made the right prediction or the wrong prediction um so the interpretation is um environment asks the agent a question and then tells the right answer um so contextual Bandits are um make this problem a little harder in that they give um The Learning agent a little bit less information um so now the environment samples an input um but notice that there's not a correct output associated with it um then the agent takes an action and uh the agent receives some cost which is from um some probability distribution so here um C is the cost we're sampling it from some probability distribution um and the agent doesn't know what this probability distribution is so that's what makes the problem hard um so environment asks the agent a question and uh the agent answers and the environment gives her a noisy score on the answer um so this is applied um this actually has a lot of applications so personalized recommendations is one big one along with advertising so um you have to besides um like uh customers who liked this I mean you for you have a customer and you know what they liked in the past so you have to make a prediction about what they're going to like in the future uh so you show them appropriate ads or links like what either like what ad what book you want to try to advertise to them or what video you want to show them and so on um so here you can the big difference between this and the supervised learning setting is you don't have access to the function uh the Lost function trying to optimize so in particular you can't differentiate through it um we don't know the process that generates C so we can't um compute the grading of the loss function and use that to tune the agent's parameters so that makes it uh so that makes the problem a bit harder or you you have to use a different kind of algorithm um lastly uh reinforcement learning is um almost the same as the contextual Bandit setting except now the environment is stateful so now instead of sampling um the initial state from scratch every time step uh from the same distribution um the um State evolves over time uh so you have some transition probability distribution called P here where um the the state X subt is uh conditioned on the previous state and the previous action and uh that makes the problem quite a bit harder because now well for a number of reasons uh for one thing the inputs you're getting depend on what actions you're taking so now that makes it harder to develop a stable reliable algorithm because now as the agent starts to learn it gets different inputs so that can lead to all sorts of um outof control um behavior and it also means you have delayed effects because uh since the system is stateful um uh you might need to take a lot of actions to get into the right state so um you might need to um you can't just greedily every time step you have to uh you have to think ahead effectively okay so just to summarize these differences there are two differences the first one is you don't have full analytic access to the function you're trying to optimize you have to query it through interaction uh second uh you're interacting with a stateful world which means that the input you get is going to depend on your previous actions and if you just take the first of those differences uh between supervised learning and reinforcement learning you get the contextual Bandit setting so that's sort of halfway in between okay so uh I realized that there uh multiple this audience probably has people with different interests uh some people are um doing research and want to know about what's the latest in the research world and some people are um want to apply these machine learning techniques to practical applications um so this slide is um for the latter group of people um so if you're wondering um if you have some problem where you think reinforcement learning might be relevant and you're wondering if you should apply reinforcement learning um so first uh I should say that the answer might be no it might be Overkill especially uh deep reinforcement learning so this is a set of fairly new techniques where it's not going to work out of the box very well um and uh it's these techniques aren't that well established so they require a lot of they have a lot of knobs to be tuned so uh it might be Overkill and yeah these techniques aren't that well established at the moment so it might be worth investigating some other methods first um so one one so if your problem has a small number of parameters you're trying to optimize over and um you have a simulator that you can uh like just do lots of experiments on um then derivative free optimization methods are likely to be better than reinforcement learning or they're likely to be easier to get working um so these methods just uh look at um they just you give them a blackbox function where you put in a parameter vector and it'll give you a noisy estimate of the score and these algorithms will just optimize uh over the parameters of that blackbox I mean that are being put into that black box um so uh yeah there's a variety of different methods um for derivative free optimization but these are easier to understand than reinforcement learning and they do kind of work out of the box um okay a lot of problems are actually um can be seen as cont contextual banded problems and the statefulness Of The World Isn't that relevant um so for example in advertising um this is where people people look at advertising as a contextual Bandit problem most of the time because you decide what ad to present the user with and then they either um click on it or they don't um but it's really um the user is kind of stateful because if you show them a terrible ad uh they might just go and download ad block uh so uh there is like your actions do have some repercussions um but um often you can just approximate it as being a contextual Bandit problem where there is no state so uh there's a better theoretical understanding of contextual Bandit problems uh and methods that are that have some guarantees so in that case um so if it is a contextual banit problem you might want to use those kind of algorithms instead um and lastly um the um operations research field has been uh using um these methods for a while on real problems and um they have a set of methods um which are um just pretty much the basic algorithms uh policy iteration and value iteration but they're um sort of well um they're welldeveloped ways of doing feature engineering for these problems that end up working pretty decently so these uh techniques are also worth considering instead of trying to throw a big neural network at it okay so now well now that I've talked about what why not to use deep reinforcement learning or what it's not good for um I'll just talk about um some recent uh success stories in deep reinforcement learning which are achievements that probably wouldn't have been possible using these other techniques um so um a a few years ago there is a pretty um influential result um by uh Min all from Deep Mind uh where they used um a deep Q learning algorithm um to play Atari games using the screen images as input um and uh that's hard because you have these G these games are you're trying to do different things in all these games and there're some of them are kind of complicated so it's pretty remarkable that you can just use a simple uh that a simple algorithm can solve them all um this the same algorithm can solve them all uh so since then people have also um solved or or solv this domain using uh policy gradients and another algorithm called dagger um so another big uh groundbreaking result was um beating a um a champion level player at go um also by Deep Mind um using a combination of um super learning from uh like from expert games plus policy gradients to fine-tune the supervised learning policy um plus Monte Carlo tree search um plus value functions to make the search work better so a combination of techniques and reinforcement learning um robotic so some of my colleagues at uh Berkeley had some um very nice results uh learning in real time how to do manipulation tasks um using an algorithm called guided policy search um using the PR2 robot um and uh some of my colleagues and I have um been working on robotic Locomotion um using um policy gradient methods and uh people have been working on Locomotion for a while and have been able to achieve pretty good results uh using uh very like highly engineered domain specific methods but um previously there hadn't been much success using general methods to solve it and last uh there have been some recent results um playing 3D games using policy gradients um in fact there was even a contest I heard about a couple days ago with this new visz Doom uh task which um is pretty nice so you might want to check out viz Doom okay so that's that's it for the highle overview uh part of this um now I'm going to start getting into the actual formalism and the technical details okay so the basic object uh in uh the field of reinforcement learning is the markof decision process um so the markof decision process is defined by the following components you have a state space this is all the different states of the system uh the action space these are all the actions the agent can take and you have um this probability distribution um which uh which determines the probability of next date and reward so R is the reward S Prime is the next state s and a are the actions so it's a conditional probability distribution sometime sometimes people split this out into a separate reward function but that's basically an equivalent formulation okay and sometimes there's some extra objects to find um will will'll be interested in the we'll we'll consider an an initial State distribution so this is um the world starts out in a certain State and uh the typical optimization problem you want to solve given this mdp is to maximize expected cumulative reward though there are various um ways of defining that more precisely which I'll go into uh later okay so there are various different settings of reinforcement learning um where you define a slightly different optimization problem the one we'll be most concerned with is called the episodic setting so here the agents experience is split up into a um a series of episodes which have um finite length so in each episode uh we first sample the initial state of the world from some probability distribution me and then um the agent uh keeps on acting until um the world ends up in some terminal state um so just to give a example of what terminal States might be like and how an epis episodic um reinforcement learning problem might look um so one example is um when termination is good and you want to terminate the episode as fast as possible uh so if we imagine setting up a task with some kind of Taxi robot that should get to the destination as fast as possible then the episode would be like one trip and uh it's terminate it's trying to terminate the episode as fast as possible um another example is um a waiter robot um where you have a fixed length shift but the waiter has to accumulate it has to do as well as possible during that shift so there the episode has a fixed length um the waiter has to say maximize tips or uh customer happiness um and then you could imagine another kind of task where uh termination is bad and you want the episode to last as long as possible um so you can view life as an example of that um but also you could imagine having a a walking robot um where uh you want it to walk as far as possible before it falls over and in this setting it's pretty easy to find to Define what the goal is um to we just want to maximize the expectation of the total reward per episode okay and the last object we're going to introduce here is um a policy so the policy is just the function that the agent uses to choose its actions so we have deterministic policies which are just uh the policy is denoted by pi so we have the action is just some function of the state and uh we also have U stochastic policies where the policy is a conditional probability distribution um so here is just we're just going to make a little bit more precise um the setting of The episodic mdp um so first we sample the initial state from this distribution me um then we um then we get uh we sample the first action from the policy a Zer from the policy then we sample next state and reward uh from the transition probability distribution and so on until we reach a terminal State s subt and then um the quantity we care about is the sum of all these rewards r0 plus R1 dot dot dot plus r subt minus one and um we want to maximize yeah so Ada is Ada of Pi is just defined as the um expected total reward of the policy Pi here's a picture that um illustrates exactly the same thing so you can look at it as a graphical model okay and lastly um in the policy gradient section in particular we're going to be interested in parameterized policies so here we have a parameter Vector um Theta which specifies U which specifies exactly what the policy is so um for example the family of policies could be just a neural n you have a certain neural network architecture and Theta specifies all the weights of this neural network so we could have a a deterministic policy of course or stochastic policy um and if you're wondering like concretely what would a policy look like I mean how do you use a neural network to represent your policy it's actually exactly you do exactly the same thing you would do if this were a classification or a regression problem uh so uh in so s here the state here is your input and the action is your output um so um if you have a discrete action space a discret set of actions um then um you would use a Network that outputs a vector of probabilities the probabilities of the different actions this is exactly like a classifier and if you have a continuous action space um you you would have your neural network output the mean and uh the diagonal of a covariance matrix of a gaussian distribution um so this is just like you're doing regression so you can use the same kind of architectures You' use in supervis learning okay so that's uh that's just the that's it for the formalism of mdps so now I'm going to go into policy gradient methods which are one uh Broad and general class of reinforcement learning methods which are um quite effective so to give a brief overview of this um here's here's the intuition of what policy grading methods are going to do um so here capital R means the sum of rewards of the whole episode episode um so our optimization problem is we want to maximize the expectation of the total reward um given our parameterized policy Pi sub Theta and um the intuition of how our algorithm is going to work is um we we're going to collect a bunch of trajectories I mean this is just run a bunch of episodes using our policy and then we want to make the good trajectories more probable so I mean some of the trajectories were lucky and they were really good some of them the agent was unlucky and they were bad and um The Good the ones that were good meaning there was high reward um that means the agent probably took good actions there so we want to uh increase the probability of the actions from those trajectories so um so the most basic version of uh policy gradient methods just try to make the good trajectories more probable without trying to figure out which were the good actions and which were the bad actions um slightly better methods or more um elaborate methods uh TR to figure out which actions were good and which ones were bad and then they try to make the good actions more probable and um lastly there's another class of methods which um which actually try to push the actions towards better actions so they differentiate the loss function with respect to the actions and they try to push the actions to better actions um so we're mostly going to talk about one and two here oh there's a question oh uh yeah good question so um well we're maximizing over the policy we're trying to find uh the best policy but here um the policy is assumed to be parameterized so there's some parameter Vector Theta that specifies the policy and now we just want to maximize with respect to Theta any other questions okay um so there's a very um a very fundamental fundamental concept which is called the score function grading estimator uh which um underlies policy gradient methods so actually to introduce this we're not going to talk about policies and RL at all we're just going to assume uh we have some expectation we have expectation of f ofx where X is sampled from some uh parameterized probability distribution so we want to compute uh the gring of this expectation with respect to Theta um so there's a general formula um that'll do this and the way you derive it is you just write the expectation as an integral um and then you just um move some things around uh you you swap the integral with the derivative and you um you turn it back into an expectation and uh what you get at the end is this bottom line which says that you take the expectation of function value times grad log probability uh so the in this is an unbiased estimator of the gradient meaning if we get enough samples it'll Converge on the right thing um so uh the way you can compute this estimator meaning the way you can get a noisy estimate of the grading of the expectation is you um just collect one you just get one sample um from this distribution and then you compute then you multiply F ofx times grad log probability um so uh the only requirement for being able to use this estimator is uh we need to be able to compute the probability density I mean we need to be able to an analytically compute it and we need to be able to differentiate it with respect to Theta and um often it needs to be differentiable um there's another uh way of deriving it using important sampling so you write down the important sampling estimator for the expectation and then you just uh swap the derivative with the expectation and you get the same thing okay so so now let me just give a little bit of intuition about this estimator oops okay so f ofx is measuring how good the sample X is um so that means that so G hat here is our gradient estimator meaning this is what we get if we take one sample X subi and we compute our estimator this is our estimate of the gradient um so if we move in Direction Gat um that pushes up the log probability of our sample xabi in proportion to how good it is so if we have really good um if we got a really good function value then we're going to try to push up its log probability a lot and if it was a bad function value then we're not going to try to push it up very much so it's pretty simple intuition um the really nice thing is um this is valid even if f ofx is discontinuous or if f ofx is um unknown meaning you only uh you don't get to differentiate it you just get to see the function values um or um the sample space um is a discrete set so X doesn't even have to be continuous um and this is um quite uh this is quite remarkable actually that you don't even need to have access to the full um you don't need to know exactly um what the function is that you're optimizing you just have to be able to query um for the function value um and this means this is a way of um being able to differentiate um functions through a system that has non-differentiable pieces um so for example in um in robotic Locomotion one issue is that um you have contacts between the robot's foot and the ground and um contact you make and break contact and that causes a discontinuous change in the Dynamics um so that makes it really hard to do smooth optimization techniques to come up with the right Behavior so when you use this kind of um grading estimator along with policy gradients which I'm going to uh talk about very soon um you can actually just uh differentiate you can optimize this system um even though it has differentiable pieces in it okay so uh okay so here's another little picture of what's going on so we have our function f ofx um which we're trying to maximize the expectation of and then we have our probability density P ofx um so we just sample a bunch of values from our probability density those are the blue dots on the x axis and um then uh we um so then we we look at the function values and um we're trying to push the uh probability distribution so that the probability goes up at um these samples in proportion to the function value um so over on the right side of the curve uh that means we're trying to push that F uh probability value up really hard and on the left side we're pushing it up softly uh so what's going to happen is the probability density is going to slide to the right if you can imagine a sort of physical analogy there okay so that's that's the score function gradient estimator this is a general technique um it can be used in various machine learning problems um now we're going to apply it to the reinforcement learning setting and um we're going to take our random variable X to be a whole trajectory um so the trajectory consists of State action reward State action reward and so on until the end of the episode and uh now um to get our gradient estimator uh to get the um uh to get the gradient of the expected reward all we've got to do is um compute the grad log probability uh times the total reward so um so this uh probability of the trajectory that sounds like a really unfriendly quantity because uh there's uh a long complicated process that's generates this trajectory with lots of uh lots of time steps but um log um okay so we can write out what this process is what this probability density is um so we have uh it's just a product of probabilities we've got our initial uh we've got our mu of s0 which is just our initial State distribution and then every time step we have um we sample the action according to Pi and we sample the next state and reward according to our Dynamics model so uh log turns that product into a sum and here's the cool part um everything that doesn't um contain Theta drops out um so the thing is we didn't know uh there are parts of this um probability uh distribution P of to given Theta that we don't have access to so if this is reinforcement learning uh we don't assume that we know the Dynamics model of the system we just find out about it by sampling uh by doing sample doing episodes um so um so since this uh product turns into a sum all the the pieces uh like the log uh log P there and the log me uh which we don't know just drop out so it doesn't matter um and uh what we get in the end is um we get a sum of log probab sum of uh log probabilities of actions so grad log Pi of action given State um so our formula looks like um our formula for the grading of the expectation is just the expectation over trajectories of um total reward of the trajectory times grad um grad of the sum of all the log probs so the interpretation of this is um we're uh taking our good trajectories and we're trying to increase their probability in proportion to how good they are um and you can think of this as uh being similar to supervised learning where we treat the good trajectories with high rewards as um positive examples in our supervised learning problem so we're using those to train the policy and which actions are good we're basically treating those actions as positive examples okay now we can improve this formula a little bit um so that was just uh the most basic uh I mean this is an unbiased estimator for the policy gradient so uh if we just take that expression inside the expectation on the right hand side and we take one sample of that it has the right mean so if we just get enough of them we're going to get the policy gradient um okay so that's um but we can also write down some other formulas uh that have the same mean but have lower variance so we can come up with better estimators for the policy gradient um and that's actually quite important because the one from the previous slide is really bad when you have uh a long a large number of time steps meaning it has really high variance so uh the first thing we can do is you uh we can use the temporal structure of the problem um by the way to derive these next bunch of formulas it just takes a bunch of really straightforward manipulation where you move around expectations um and I'm not going to go through all the math um but uh I'll just say what the formulas are so um okay so we can repeat the same argument from the previous slide um to just derive the gradient estimator for a single reward term so we end up with that reward term times the grad some of log probs and just summing over that we get a new formula um where we're not multiplying the sum of the the grad log prob of the whole thing times the sum of all rewards now um so let's look at that bottom formula um now we have a sum over time of grad log probability of the action at time that time times the sum of future rewards um so so now I mean in the formula from the previous slide we would have had all the rewards in that sum um but now we just have the future rewards and um that kind of makes sense because um an action can't affect the probability of the um previous rewards uh so to figure out if the action is good we should have only we should only be looking at the future rewards so this is a slightly better formula than the one on the previous slide meaning it has this exact same mean um except uh different uh the expr inside the expectation there has lower variance um and we can further reduce the variance by introducing a baseline um so now uh we can take any old function uh B which takes in a state and it outputs a real number and um we can subtract it from our sum of future rewards and um we didn't affect the mean of the um estimator at all so we yeah we didn't change uh the expect At All by introducing this Baseline um so yeah for any choice of B this gives us an unbiased estimator by the way if you're not that familiar with the terminology of estimators what I'm saying is uh we have a um expectation um on the right hand side of that for uh formula uh and uh the quantity inside that expectation is What's called the estimator and um if we get a bunch of samples uh then we can get an estimate of um of the thing on the left hand side which is what we care about so um so when I say it's an unbiased estimator that just means that well that just means that this equation is correct meaning that the thing on the right hand side equals the thing on the left hand side um so yeah this works for any choice of Baseline and um a near optimal choice is to use the expected return so the expected sum of future rewards and uh the interpretation of that is um if we took an action we only want to increase the probability of the action if it was a good action um so how do we tell if it was a good action well the sum of rewards after that action should have been better than expected um so the B ofs is the expected sum of rewards and we're just taking the difference between the measured thing and the expected thing yeah okay so uh that's okay that's the that that was a pretty key thing for variance reduction um and I'm going to introduce one last um variance reduction technique and actually all three of these are really important so um basically nothing's going to work um except for maybe really small scale problems unless you do these things um so the last variance reduction technique is to to use discounts um so um the discount Factor um ignores delayed effects between actions and Rewards so what we we're going to do here looks kind of like a hack but there's an explanation for it um which is instead of taking the sum of rewards uh we're going to take a discounted sum of rewards meaning that um we uh we add this exponential Factor uh gamma so that um when so when we're multiplying the grad log probability by some future award uh we multiply it by some uh quantity that decays with time so people typically use like gamma equals 99 or gamma equals 095 uh so that means like if you Ed 099 that means after 100 time steps um you're going to be um uh you're going to be reducing the reward by a factor of one over e so um so you're exponentially um you're decaying the um effect of the future rewards and the intuition is that um an action uh the action shouldn't affect rewards really far in the future like the system should um the s u the like the assumption is that the system doesn't have really long-term memory and it's sort of resets it's or or the there aren't effect the effects aren't that far delayed uh so you can just ignore um the interaction between action and a a reward way way in the future that's the uh intuition um so now instead of taking the Baseline to be the expected sum of future rewards we want to do a discounted sum uh so now were measuring if the action was better than expected according to this um like the according to the discounted sum um and now there's a more General class of formulas that looks like the one that I just wrote so this this one that's on the top of the slide is pretty good and um this is like almost as good as anything you're going to do to within a small constant Factor uh but um there's there's a more General class of formulas that um look like um grad log probability times uh some quantity a hat which we call the advantage estimate and this is in general just going to be um an estimate of um this is an it has a more a precise definition which is how much uh how like how much was this action um better than the um average action taken by the policy but in but informally this just means how much better was the action then expected so and and this formula makes a lot of sense because we we want to increase the probability of the good actions and de decrease the probability of the bad ones so we should um we should increase it in proportion to the goodness of the action okay so just to summarize so I just told you there's this gradient estimator meaning there's this expression you can compute which gives you a noisy estimate of the policy gradient so how do you actually turn this into an algorithm uh so this is silly algorithm s uh so um so here's what the algorithm looks like it's pretty much what you'd expect uh you um you take your policy um you initialize your policy parameter and your Baseline function um you uh for each it each iteration you um execute the um the uh current policy to get a bunch of whole episodes meaning whole trajectories and um each time step in the each trajectory you should compute the return meaning the sum of rewards following that time step the sum of discounted rewards and the advantage estimate which is um the sum of discounted rewards minus the Baseline uh then you refit the Baseline by trying to um make the Baseline function equal the returns uh and then um you update the policy using a policy gradient estimator so you're just doing SGD while updating the Baseline as you go along yeah so that's that's the vanilla policy gradient algorithm um and this is um I'll briefly talk this has been used to obtain some pretty good results so it's not um that bad of an algorithm but um there there several different directions that it can be improved so one one uh issue that you run into um is with step sizes um so in supervised learning step sizes aren't that big of a deal um because uh maybe you take too big of a step um but that's okay um you'll fix it the next update and um your uh current function your current classifier for example doesn't affect what inputs you're getting so even if you just um are doing really uh even if your network is just kind of thrashing around for a while because you're taking too big steps uh that's not going to cause any problems um but um uh yeah and reinfor so yeah so step sizes aren't that big of a deal you can just anal them uh you can start off with a large step size and anal them down to zero and that um works pretty well um in reinforcement learning if you take too big of a step you might wreck your policy um and even if you don't actually change the network that much so you don't lose all your nice features um you you might just change its Behavior a little too much and now it's going to do something totally different and visit a totally different part of State space um so since in reinforcement learning the system is stateful and your state distribution depends on your policy that makes that like brings uh a really a different problem and uh now like after you took that step the next batch of data you're going to get was collected by the bad policy and now you're never going to recover because you just forgot everything yeah so um One Way um that uh my colleagues and I well one way to fix this is to try to um to try to stop the basically try to stop the policy from taking too big of a step so um you can look at the K Divergence between the um old policy and the new policy um like before the update and after the update and make sure that um the uh distributions aren't that different so you're not taking too big of a step it's kind of an obvious thing to do uh so my colleagues and I developed an algorithm called trust region policy optimization um which looks at the yeah looks at the action distributions and tries to make sure the K Divergence isn't too large and uh there's this is very closely related to previous meth natural policy gradient methods which uh which are based on um which are doing something similar but usually it's not um set up as a hard constraint on the K Divergence so another um type of extension of policy gradient methods is um to do more uh to use value uh value functions to do um more variance reduction um instead of just using them as a baseline you can also um you can use them more aggressively and introduce some bias um so I won't go into the details in this talk um but um sometimes these are called actor critic methods um there's also another type of approach which I briefly touched on in the um earlier slide um where instead of just trying to increase the probability of the good actions you actually differentiate your loss with respect to the actions um this is like the reparameterization trick which is used um for um like for density modeling and unsupervised learning um so uh here you're trying to instead of just increasing the probability of the good actions you're trying to push the actions towards better actions and I'd say both of these bullet points um you're um potentially decreasing your variance a lot but at the cost of increasing bias so it's actually U makes the algorithms a little harder to um like to understand and to get them working because um with high variance if you just uh crank up the amount of data you can always drive your variants down as much as you want but with bias even if no matter how much data you get you're not going to get rid of the bias so if your grading is pointing in the wrong direction then you're not going to learn anything okay so now uh that that's it for the policy gradient section of this um this talk um so I wanted to show a quick video of uh some work that my colleagues and I did on learning Locomotion controllers with policy gradient methods which I think um well I found pretty exciting when I saw it uh so hopefully it's you find it interesting so here what we've got is a um humanoid a simulated let's see it okay yeah it's a simulated humanoid robot um in a physics simulator a realistic physics simulator called Moko and uh it has a neural network policy uh which takes in um The Joint angles of the robot and uh maybe some and a little bit of other kinematic information like joint it's got joint velocities and also um positions of the different um Links of the robot so that's what the input is it's pretty much the raw um state of the robot like no clever feature engineering there and um the output is going to be the joint torqus which are set 100 times a second so we're just mapping from joint angles to joint torqus and uh we Define a reward function which is to move forward as fast as possible so it gets a reward for moving forward and um it gets a uh so um the episode ends when it its head goes below a certain height meaning it fell over so that's basically the setup there was a little bit of tweaking for the reward function but um not too extensive um so whoops yeah so you can see first it just Falls forward a lot of times and then slowly it starts to develop a uh half decent looking walk and uh eventually it gets it down pretty well and at the very end of this um it could just keep running uh indefinitely so I think it was actually stable in a strong sense meaning I could just leave it for 15 minutes and it wouldn't fall over it would just keep going so uh here's another um robot model that um we just created without too much thought I mean we just decided to put a bunch of legs on this thing um and uh so we don't even know how this thing is supposed to walk um and uh just give it to the same algorithm and it just figures out some kind of crazy way to walk um so that's the nice thing about reinforcement learning uh you don't even need to know what you want it to do um I think this is also the physics are a little unrealistic here but here we set up up we used this um a similar model to the one in the first uh demo but uh here we just give it a reward for having its head at a certain height so there's a re word telling it to get its head up as high as possible and then it figures out how to get up off the ground oh let's see um I have I have low battery uh does anyone have a charger that I could oh thanks a lot you're a lifesaver okay any questions about policy gradients before I move on to the next part m oh yeah so the question was is the system time invariant uh yes the that's that's assumed that it's stationary oh right and also that it doesn't change from one episode to the next of course in some real world problems that might not be the case so that's I think that's also an interesting problem setting where you have a non-stationary environment um so the question question was uh for the Baseline to learn a good Baseline uh do you need to know the Dynamics of the system um so no you can just learn it by doing regression you just uh estimate the empirical returns and then you do regression to try to uh fit a function to that yeah so the question is um there's a discount factor which um causes the um which should cause the policy to disregard any effects that are delayed by more than a 100 time steps so um how does it still work that this guy learns how to stand up um even though that might take more than 100 time steps is that correct yeah um so yeah you're right um and in fact I would say that these methods um aren't guaranteed to work well if you have more than a 100 time steps uh so sometimes they work anyway often they work anyway but there's no guarantee um so I think there's actually something pretty fundamental missing in how uh like how to deal with really long time scales and people have recently been thinking about hierarchical reinforcement learning where you have um different levels of uh detail of the system where you might have a like one level of description where you have a um like a short time a small time step and then you have successively larger time steps and uh you can that allows you to plan over much longer Horizons um so that's something that's currently in active area of research but yeah I would say that none of these methods are going to um do are guaranteed to do anything reasonable if you have uh more than one over one minus gamma time steps uh between action and reward oh yeah so in this kind of task if you introduced terrain or something could it uh I think if it didn't if you didn't train it to deal with terrain then it um then it might fail it probably would fail actually I don't think it would fail because uh the funny thing about these policies are actually really robust because um you train them with the stochastic policy policy um so there's a lot of noise being generated by the policy itself um so in practice uh it's um it's actually so it's able to deal with huge noise introduced by the policy and as a result um I found that if you um change the Dynamics parameters a little it can usually still work but yeah there's no guarantee that it'll do anything if you give it something you didn't train it for um I I think that you probably could train it um this do the same kind of training with uh terrain I didn't have any terrain so I didn't try it but that would be nice to try okay I'm going to move on to the next part of the talk uh feel free if you have more questions to find me afterwards okay so now I'm going to talk about a different uh type of reinforcement learning algorithm so okay so these uh so the previous kind of methods are distinguished by the fact that they learn they explicitly represent a policy which is the function that chooses your actions and they try to optimize it with respect to the parameters of the policy um so the nice thing about the policy gradient methods we just talked about is that you're optimizing the thing you care about um so and you're optimizing it with gradient descent so that makes it kind of easy to understand what's going on um because if you take if you're getting the proper grading estimate and you take small enough steps then you should be improving I mean of course you still could get stuck in a local minimum but at least uh or you get stuck in a bad local minimum but at least it's a local minimum and you can use the our understanding of optimization to figure out what's going on so these next class of methods are a little different because um they're not optimizing the policy directly uh they're learning something else called a q function uh which measures how good um State action pairs are are so it measures um I'll I'll say that more formally L later but it's just measuring how good the actions are um and uh these methods are actually um the these are um able to ex exactly solve um mdps efficiently in uh the setting where you have a finite number of states and actions um so these are these are the preferred methods for exactly solving them in in those settings um but um you can apply them uh with um continuous States and actions and um using um using expressive function approximators like neural networks but it's a little harder to understand um what's going on in these methods like when they're going to work and when they're not going to work so um I'll Define um the relevant quantities here uh so the Q function is defined as uh the expected sum of rewards um when we condition on the first state and the first action um so we're conditioning on s0 equals s a0 equals a and we're um we're and the Q function is the expected discounted sum of rewards uh when we're acting under the policy Pi so um by convention I'm starting out with uh time Step Zero I could have also said that um we're taking RT plus RT + 1 plus RT plus two and so on uh but since we're assuming the system is stationary it should be exactly the same so just by convention I'm going to say that the first I'm going to always use time Z 1 two three and so on Just for ease of notation so the Q function is just telling you how good and this state action pair is under your current policy um the value function well the state value function usually called V is uh just um conditioning on the state it's uh telling you how good that state is what's the expected reward at that State and lastly there's an the advantage function is the difference between the Q function and the state value function meaning how much better is that action than uh what the policy would have done we're not going to talk about Advantage functions in this section but it was actually this corresponds to the notion of Advantage estimator uh we briefly mentioned in the previous section so here we're going to consider um methods that explicitly store and update the Q function instead of the policy and um updates them using uh what are called Bellman equations so um so the Bellman equation um so a Bellman equation in general is a consistency equation that should be satisfied by a value function um so here um I'm writing down the Bellman equation for QP and um what it's saying is that um the um expected sum of rewards should be um the first reward plus this expected sum of rewards at after the first time step so it's saying something pretty simple that's um so r0 is the first reward uh V Pi of S1 is just um adding up all the rewards at after at After Time Step Zero um so uh in the second equation we write out this relationship just involving the Q function so we have a consistency equation that the Q function should satisfy um we can slightly generalize this to use um ktime steps instead of just one time step so uh we can expand out the um expectation the expected sum of rewards to write write out K rewards explicitly and then uh cap it off with the value function at the very end which accounts for all the rewards after that okay so here's the Bellman equation from the previous slide so now I'm going to introduce a very important concept called a Bellman backup so uh so we have this equation that the uh value the um value function Q Pi should satisfy um but we don't know Q let's assume we don't know Q Pi so let's say we have some uh some other Q function um so we Define this Bellman backup operator that that operates on an arbitrary Q function so it maps The Q function to a new Q function and uh it's defined by just taking the right hand side of the Bellman equation and uh plugging in um our Q function our new Q function Q instead of the um Q Pi so uh Q Pi is a fix point of this operator u meaning if we apply the backup operator we get it the same thing back and um and very nicely if we keep applying this backup operator repeatedly to any old arbitrary initial um Q function Q the series will converge to Q Pi which is the fix point of the operator so that's uh yeah so that's um so that way you can uh you can um one way you can use an iterative algorithm to estimate Q Pi by taking any old initial Q function and repeatedly applying this backup operator um so now there's another kind of Q function that we're going to introduce uh called qar so the previous Q function Q Pi was this is the uh telling you uh the value function under the current under some policy Pi so it only makes sense with regard to some particular fixed policy Pi qar um is going to be um is going to involve the optimal policy instead so um so qar is just defined as the Q function of the optimal policy so here we have Pi star the optimal policy and qar is just the Q function of the optimal policy and um it also happens to be uh the pointwise maximum over all policies of uh the Q function um at each state action pair so uh so the optimal policy is deterministic and um it should satisfy this equation that um it takes the argmax of the optimal Q function so recall that the Q function tells you your expected return if you take the um the given action um so obviously the optimal policy should take the action that has the best expected return so that's why um that's why this last equation is um evident um so um so now now that we know this property of the optimal policy uh we can rewrite the Bellman equation so uh so on the that that first equation is that's just the Bellman equation from the previous slides for given policy Pi um now um we can take that expectation over actions and replace it by what the optimal policy is going to do which is just going to take it's going to take the argmax of the optimal Q function there's a typo on my slide that should say qar um inside of um on the right hand side so um so now we have a Bellman equation that the optimal policy should satisfy now we can do the same thing with the backup operator um so um we we take that Bellman equation and we uh plug in an arbitrary Q function on the right hand side instead of the optimal Q function qar um so qar um is a fixed point of this Bellman operator that's just a restatement of the Bellman equation and uh again if we reply um this Bellman operator repeatedly to an arbitrary initial Q function it converges to qar which is the optimal Q function this is um the BAC F fixo theorem in both cases can be used to prove it okay so based on these ideas um there are two classic algorithms for exactly solving mdps these are sometimes called dynamic programming algorithms because they're actually quite related to the kind of dynamic programming algorithms that are used to solve uh search problems um so one is called value iteration and you just initialize your Q function arbitrarily and you repeatedly do Bellman backups until it converges uh the second one is called policy iteration um you initialize your policy arbitrarily uh then uh each step you uh first can compute um either exactly or approximately uh the Q function of that policy and then uh you update your policy to be the greedy policy for the Q function you just computed uh so that means that uh you your new policy just takes the argmax of the Q function so it takes the action that's best according to that Q function um so I didn't say anything about how you compute QP uh so one way to do it is to compute it um you can compute it exactly because it happens that the Bellman equation for QP is a linear system equations so often you can just solve them exactly um more commonly well if you have a large scale problem you might not be able to solve this system uh so what people often do is they do um a finite number of bellman backups uh which gives you which doesn't exactly Converge on Q Pi but it gives you something that's approximately Q Pi okay so that's um I just told you algorithms that you can Implement if you have full access to the mdp like you know the whole table of probabilities um but in reinforcement learning usually the assumption is that you don't know any of these probability distributions you don't know the reward function you don't know the transition probabilities so all these things things have to be um estimated from data or they have to or you're only able to access the system through interaction so now it turns out that these uh algorithms can be um also implemented if you only access the system through interactions which is kind of remarkable I think um so so the way it works is um so let's recall our backup formulas for Q pi and qar um so we can um so we can in both cases we have this a certain quantity inside an expectation in both both cases we can compute an unbiased estimator um of the right of that quantity inside the expectation just using a single sample meaning uh if we have uh if we sampled some data from our system um using any old policy uh then uh we can get an unbiased estimator of the quantity on the right hand side of those expectations I mean the quantity on inside of the right hand expectations so basically we can do an approximate version of this uh Bellman backup uh which is unbiased um and uh even with this noise so we're doing a noisy version of the Bellman backup even with this noise it can be proven that if you do if you choose your step size appropriately with the right schedule you're still going to converge to um QP um or qar um depending uh on which algorithm you're implementing okay so now well well I'll say at this point that this is uh pretty much the fundamental idea and now you can uh you can come up with algorithms uh that can be applied in the uh reinforcement learning setting where you're just accessing the system through sampling and you can also uh start introducing function approximation here so in I haven't said anything about what the Q function is I've just told you it's a function of state and action um but now we can start having neural network Q functions for example um so uh so we can parameterize the Q function with the neural network um let's call it Q Theta um and now um instead of doing the Bellman backup I mean it doesn't make sense to do the Bellman backup exactly because we're not just setting the values of the neural network output um the best we can do is try to um like encourage the neural network to have some output values so what we do is instead of doing the um the way we do this backup is we set up a lease squares problem uh so we write down this quadratic objective that says that the Q function should be approximately equal to the back upb value and then we just minimize it with uh SGD um so one version of this algorithm uh which was introduced about 10 years ago called neural fitted Q iteration um well it works exactly the way you'd expect you you sample trajectories um using your current policy uh which might be um determined by the Q function or it could be any old policy as it turns out um and uh then you um you solve the lease squares problem where you're trying to minimize um this quadratic um you you try to minimize this quadratic error which is um based on the um Bellman backup the backup for qar so um one thing I haven't mentioned so far is what do you actually use as your policy so I said sample trajectory using your policy um so if you have a Q function you can turn it into a policy um by just uh taking the action that has the highest Q value that's what you typically do so the Q function measures the goodness of all your actions so you can easily turn that into a policy by taking your best action or by taking actions um where uh the log probability is um proportional the goodness or something like that so you you might take typically probability is uh is exponential of Q value um over some kind of temperature parameter um that's called boltzman exploration um whereas um if you use just the um greedy if you just take the argmax that's called the greedy policy so um it turns out that with these kind of Q learning algorithms you don't have to execute the greedy policy um to for learning to work um there's you actually have some freedom in what policy you can execute um which is actually one very nice property of these algorithms that you can use an exploration technique um which uh where your policy is actively trying to reach um new states or do something new and uh still learn the correct uh still uh converge still move towards qar or QP as the case may be okay so that's uh so that's a um a very basic neural fitted Q iteration is sort of a basic way of doing this a more recent algorithm that's gotten a lot of attention is the one that was um from M at all from Deep Mind uh which is basically an online version of this algorithm uh with a with a couple of um useful tweaks in it so um and but actually when you look at the two tricks they're actually kind of um very um they make a lot of sense if you just think about what valuator is doing so uh one uh one technique is uh you use this uh replay pool where it's a rolling history of your past data and um that's just the data you're going to use to fit your Q function um so that makes sure you have like a representative sample of data um to uh fit your Q function to and um the second the second uh idea is to use a Target Network um so instead of using your Q current Q function and just doing Bellman backups on that um you have some lagged version of your Q function so you have this target Network which is a copy of your Q function at some earlier time and you use that in the backups so that also um if you think about value iteration uh you're trying to you have your old Q function and you're trying to make the new one equal to the backup version of the old one so using the target Network just is sort of the natural thing to do if you're trying to implement value iteration in an online way um so and there have been many extensions for post since then I've got a bunch of citations at the bottom of the slide um so this algorithm the d uh dqn algorithm uh is um is using the um backup B which is the backup for qar um remember that I also introduced this other backup BP which is the backup for qpy um so so there's another algorithm like a very classic algorithm called sarsa uh which um is an online way of um doing the bpy backup essentially um well it's sort of an online version of policy iteration um uh but uh so it's it's actually um found to work as well um or better than DQ well better than using the B backup in some settings um not all settings so I think the jury's still out um on exactly um how these things compare um but uh it's I think um it's worth considering both policy iteration and value iteration and and all the different online versions of these algorithms and taking them seriously because it's not clear right now exactly which are how how they all compare each other in the function approximation setting okay so that's uh that's the overview of all the technical parts and now I just have a couple conclusion slides um so so let me just summarize the current state of affairs I introduced uh two kinds of algorithms uh policy grading algorithms which explicitly represent a policy and optimize it and um Q function learning algorithms which explicitly represent a q function which is the goodness of different actions and use that to implicitly represent a policy um so so policy gradient methods there's a lot of um uh so there have been some successes with different kinds different variants of it so you have vanilla policy gradient methods um there is a recent paper um on this uh a3c method um which is an async uh implementation of it uh which gets very good results um there's also um another kind of methods are the natural policy gradient methods trust region methods oh so the video I showed you was obtained using uh trust region policy optimization which is one of these in the second category so that makes it um I think these trust region methods and natural policy gradient methods are are uh More Sample efficient than the um vanilla methods because uh you end up um you're doing more than one uh grading update with each little bit of data you collect so with the vanilla policy gradient you just compute one little gradient estimate and then you throw it away with natural policy gradient you're solving a little optimization problem with it so you get more juice out of it um so that's um that's what we have in the policy gradient World um and uh in the Q function world we have uh the dqn algorithm and some of its relatives um and these are sort of uh descendants of value iteration um where you're approximating the Bellman backup using value iteration um and then sarsa is um also it's related to policy iteration um these are both different I mean these are uh estimating different they're dealing with different Bellman equations so it's kind of interesting that both kinds of methods work and they all they're both they have fairly similar behaviors as a turns out um so here's what I would say the um here's how I would compare them and this is um like anecdotal evidence but uh I think this is the consensus right now um the Q function methods are more sample efficient when they work um but uh they don't work as generally as policy gradient methods and it's it's a little harder to figure out what's going on uh when they don't work um and that kind of makes sense because in the policy gradient methods you're optimizing exactly the thing you care about with gradient descent whereas with Q function methods you're doing something indirect where you're optim you're trying to learn a q function and then you're hoping that it gives you a good policy um and yeah so I would also point out that there um there are also some confounds so it's hard to make a good conclusion at this point because people use um uh different um like time Horizons in the policy gradient methods versus the Q function methods so they do one step look ahads on the Q functions and multi-step look ahads on the policy gradients so it's not clear if the extra if the differences come from like using different time Horizons or um some differences in how the algorithms are working because you're either doing regression for a q function versus uh learning a policy using policy gradients um so just to summarize it I would say here here are some of our core model free reinforcement learning algorithms and uh they oh whoops I'm missing a word in the First Column which I think should say uh like reliability and robustness uh so this just means like is it going to work on um new problems without um like without parameter tuning um or is it going to um mysteriously either work or not work um so this this would be my um slightly sloppy summary of um all these different algorithms I would say there's still some room for improvement um there might be some improvements in the basic methods because uh there's some nice properties of the Q function methods um that we don't have in the policy gradient methods like you can easily do off you can easily um explore with a different policy than the one that you're um learning the Q function for and that's really important um you can do that very easily with policy grading methods um whereas the policy grading methods just seem like they're more um you can just apply them and they're like more likely to work and uh it's well understood what's going on so I think yeah there's still I don't know if it's possible to get the best of both worlds but that's uh that's the Hope um and that's it for my talk thank you any questions oh yeah so in model-based reinforcement learning uh what lines of research do I find most interesting I think the work uh from my colleagues on guided policy search is very nice so I would say that's a kind of modelbased reinforcement learning um I also like um there's some methods that are using the model for faster learning like for variance reduction so there's a paper called stochastic value gradients that I like a lot um I think it's a pretty wide openen area so I don't think there have been a lot of really compelling results uh where you're able to learn extremely fast uh like you're able to learn with much better sample efficiency using a model so it's seems like that should be possible but I don't think it's been demonstrated um yet so maybe in the next couple years we'll see that happen hello uh hi uh thanks for the talk so I have a question is that is it true or not true that um most of this problem require some kind of simulated uh world to to uh run experiments in the episodes right oh yeah so um are you asking um does this work in the real world is that the question or does um yeah I would say um it it does work if you have a lot of patience and you're willing to execute this thing for a while so the um Locomotion results I showed um add up to about two weeks of real time uh so it's actually not that bad especially when you consider uh that babies toddlers take a while to learn how to walk properly even though Evolution already puts in a lot of uh built-in information um so uh i' I'd say um maybe yeah I'd say it it's it can be run in the real world some of my colleagues in Berkeley are doing uh some experiments where they are running just regular reinforcement learning algorithms in the real world um very brave um but uh I hope to see some nice results from that soon thank you hi uh thanks for your talk here on the other side here um I was wondering what was your intuition on the Lost surface of those uh deep reinforcement learning optimization problems and um maybe especially how it evolves in the from the as the policy learns and I should specify in the policy G in case so I think the situation is a little bit different in reinforcement learning from in supervised learning uh so in reinforcement learning the uh loss you have um you have one kind of local Minima um in policy space um so for example um let's say you want your so I'm going keep going back to The Locomotion example because I spent a lot of time on it but uh let's say you want your robot to walk um there's one local minimum where it just stands and it doesn't bother to walk because there's too much penalty for falling over and there's another local minimum where it just Dives forward because uh it gets a little bit of reward for that before it falls to its Doom um so uh so even so I think that that's actually the the hard part about the uh like the optimization problem is actually Define is because of the different be space of behaviors and actually has nothing to do with the neural network um so I've also found that um yeah it matters surprisingly little what kind of architecture you use um like what kind of neural network architecture you use because I think the most of the hardness and the weirdness of the problem comes from uh like what the Behavior space looks like rather than what the actual numerical optimization landscape looks like cool thank you so uh there are many problems where uh the reward is only observed uh at the end of the task so in the final in the terminal state in each episode uh and you don't see rewards uh in intermediate States so how much harder do these problems become for deep reinforcement learning in your experience thanks yeah so you have if you don't get the reward until the end then um then you can't um well then it's probably it might be harder to learn yeah I I don't have anything uh anything precise to say about that I think it's going to be harder if you have less if your rewards are further away yeah so so for example for your uh in your video for the last example of getting up and getting the head above a certain height for example that could be one where you only get a plus one if you're above and you don't get anything below are you doing something that was kind of if you get your head higher then you still get something partial yeah so I think we uh came up with a reward like distance from height squared um which made the problem easier um yeah the problem would have been a lot harder if you get zero reward until you get your head above the height um and it's actually um that would be a problem of exploration which is that you have to um explore all the different states be to figure out where you're going to get good reward thanks okay uh one last question okay uh so I have a question about how do you choose to quantize your space time because in your Locomotion example you clearly has the continuous system right uh yeah so it's actually really important how you Discord eyesee time like what time step you use because um if because the algorithm has um I mean the algorithm does care about what the time step is so it's not like um yeah because you um you have discount factors and you're also sampling a different action at every time step so um yeah so if you choose too small of a Time step then uh you then the rewards will be delayed by more time steps so that makes the like the credit assignment harder and also um your exploration will be more like a random walk because you're changing your mind really frequently so yeah the time step is pretty important and I'd say that's um that's a flaw in current methods okay thank [Applause] you thank you so take a short break we convene in 15 minutes