Transcript preview
Open
Kind: captions Language: en today we will talk about deep reinforcement learning the question we would like to explore it's to which degree we can teach systems to act to perceive and act in this world from data so let's take a step back and think of what is the full range of tasks then artificial intelligence system needs to accomplish here's the stack from top to bottom top the input bottom output the environment at the top the world that the agent is operating in sensed by sensors taking in the world outside and converting it to raw data interpretable by machines sensor data and from that raw sensor data you extract features you extract structure from that data such that you can input it make sense of it discriminate separate understand the data and as we discussed you form higher and higher order representations a hierarchy of representations based on which the machine learning techniques can then be applied once the machine learning techniques the understanding as I mentioned converts the data into features into higher order representations and into simple actionable useful information we aggregate that information into knowledge we take the pieces of knowledge extracted from the data through the machine learning techniques and to build a taxonomy a library of knowledge and with that knowledge we reason an aging estas to reason to aggregate to connect pieces of data it's seen in the recent past or the distant past to make sense of the world that's operating in and finally to make a plan of how to act in that world based on its objectives based on what it wants to accomplished as I mentioned a simple but commonly accepted definition of intelligence is a system that's able to accomplish complex goals so system that's operating in the environment in this world must have a goal must have an objective function a reward function and based on that it forms a plan and takes action and because there operates in many cases in the physical world it must have tools effectors with which it applies the actions to change something about the world that's the full stack of an artificial intelligence system that acts in the world and the question is what kind of task can such a system take on what kind of task can an artificial intelligence system learn as we understand AI today we will talk about the advancement of deeper enforcement learning approaches and some of the fascinating ways it's able to take much of the stack and treat it as an end-to-end learning problem but we look at games we look at simple formalized worlds while it's still impressive beautiful and unprecedented accomplishments it's nevertheless formal tasks can we then move beyond games and into expert tasks of medical diagnosis of design and into natural language and finally the human level tasks of emotion imagination consciousness let's once again review the stack in practicality in the tools we have the input for robots operating in the world from cars to humanoid to drones as light our camera radar GPS stereo cameras audio microphone networking for communication and the various ways to measure kinematics with IMU the raw sensory data is then processed features of form to representations are formed and multiple higher and higher order representations that's what deep learning gets us before neural networks before the advent of before the recent successes of neural networks to go deeper and therefore be able to form high order representations of the data that was done by experts by human experts today networks are able to do that that's the representation piece and on top of the representation piece the final layers these networks are able to accomplish the supervised learning tasks the generative tasks and the unsupervised clustering tasks through machine learning that's what we talked about a little in lecture one and we'll continue tomorrow and Wednesday that's supervised learning and you can think about the output of those networks as simple clean useful valuable information that's the knowledge and that knowledge can be in the form of single numbers it could be regression continuous variables it could be a sequence of numbers it can be images audio sentences text speech once that knowledge is extracted and aggregated how do we connect it in multi resolution always form hierarchies of ideas connect ideas the trivial silly example is connecting images activity recognition and audio for example if it looks like a duck quacks like a duck and swims like a duck we do not currently have approaches that effectively integrate this information to produce a higher confidence estimate that is in fact the duck and the planning piece the task of taking the sensory information fusing the sensory information and making action control and longer-term plans based on that information as we'll discuss today are more and more amenable to the learning approach to the deep learning approach but to date have been the most successful and non learning optimization based approaches like with the several of the guest speakers we have including the creator of this robot Atlas in Boston Dynamics so the question how much of the stack can be learned and to end from the input to the output we know we can learn the representation and the knowledge from the representation and to knowledge even with the kernel methods of SVM and certainly with with neural networks mapping from representation to information has been where the primary success of machine learning over the past three decades has been mapping from raw sensory data to knowledge that's where the success the automated representation learning of deep learning has been a success going straight from raw data to knowledge the open question for us today and beyond is if we can expand the red box there of what can be learned and to end from sensory data to reasoning so aggregating forming higher representations of the extracted knowledge and forming plans and acting in this world from the raw sensory data we will show the incredible fact that we're able to do CERN exactly what's shown here and to end with deeper enforcement learning on trivial tasks in a generalizable way the question is whether that can then move on to real-world tasks of autonomous vehicles of humanoid robotics and so on that's the open question so today let's talk about reinforcement learning there's three types of machine learning supervised unsupervised are the categories at the extremes in relative to the amount of human and human input that's required for supervised learning every piece of data that's used for teaching these systems is first labeled by human beings and unsupervised learning on the right is no data is labeled by human beings in between is some sparse input from humans semi-supervised learning is when only part of the data is provided by humans ground truth and the rest must be inferred generalized by the system and that's what reinforcement learning Falls reinforcement learning has shown there with the cats as I said every successful presentation must include cats they're supposed to be Pavlov's cats and ringing a bell and every time they ring a bell they're given food and they learn this process the goal of reinforcement learning is to learn from sparse reward data from learn from sparse supervised data and take advantage of the fact that in simulation or in the real world there is a temporal consistency to the world there is a temporal dynamics that follows from state to state the state through time and so you can propagate information even if the information that you're received about the the supervision the ground truth is sparse you can follow that information back through time to infer something about the reality of what happened before then even if your reward signals were weak so it's using the fact that the physical world evolves through time and some some sort of predictable way to take sparse information and generalize it over the entirety of the experience as being learned so we apply this the two problems today we'll talk about deep traffic as a methodology deep reinforcement learning so deep traffic is a competition that we ran last year and expanded significantly this year and I'll talk about some of the details and how the folks in this room can on your smart phone today or if you have a laptop training agent while I'm talking training a neural network in the browser some of the things we've added our we've added the capability we've now turned it into a multi agent deeper enforcement learning problem where you can control up to ten cars within your network perhaps less significant but pretty cool is the ability to customize the way the agent looks so you can upload and people have to an absurd degree have already begun doing so uploading different images instead of the car that's shown there as long as it maintains the dimensions shown here is a SpaceX rocket the competition is hosted on the website self-driving cars that MIT ID you slash deep traffic will return to this later the code is on github with some more information a starter code and a paper describing some of the fundamental insights that will help you win at this competition is an archive so from supervised learning in lecture one to today supervised learning we can think of as memorization of ground truth data in order to form representations that generalizes from that ground truth reinforcement learning is we can think of as a way to brute force propagate that information the sparse information through time to to assign quality reward to state that does not directly have a reward to make sense of this world when the rewards are sparse but are connected through time you can think of that as reasoning so the through time is modeled in most reinforcement learning approaches very simply that there's an agent taking an action in a state and receiving a little reward and the agent operating in an environment execute an action receives an observed state and new state and receives their reward this process continues over and over and some examples we can think of any of the video games some of which we'll talk about today like Atari breakout as the environment the agent is the paddle each action that the agent takes has an influence on the evolution of the environment and the success is measured by some reward mechanism in this case points are given by the game and every game has a different point scheme that must be converted normalized into a way that's interpreted by the system and the goal is to maximize those points maximize the reward the continuous problem of card pole by balancing the goal is to balance the pole on top of a moving cart the state is the angle the angular speed the position of horizontal velocity the actions are the horizontal force applied to the cart and the reward is one at each time step if the pole is still upright all the first-person shooters the video games is now Starcraft the strategy games in case of first-person shooter and doom what is the goal the environment is the game the goal is to eliminate all opponents the state is the raw game pixels coming in the actions is moving up down left right and so on and the reward is positive when eliminating an opponent and negative when the agent is eliminated industrial robotics been packin with a robotic arm the goal is to pick up a device from a box and put it into a container the state is the raw pixels of the real world that the robot observes the actions are the possible actions of the robot the different degrees of freedom are moving through those degrees moving the different actuators to realize of the position of the arm and the reward is positive when placing a device successfully and negative otherwise everything could be modeled in this way Markov decision process there's a state as zero action a zero and reward received a new state is achieved again action rewards state action rewards state until a terminal state is reached and the major components of reinforcement learning is a policy some kind of plan of what to do in every single state what kind of action to perform a value function a some kind of sense of what is a good state to be in of what is a good action to take in a state and sometimes a model that the agent represents the environment with some kind of sense of the environment its operating in the dynamics of that environment that's useful for making decisions about actions let's take a trivial example a grid world of three by four twelve squares we start at the bottom left and their task with walking about this world to maximize reward they're awarded at the top right is a plus 1 and a 1 square below that is a negative 1 and every step you take is a punishment or is a negative reward of 0.04 so what is the optimal policy in this world now when everything is deterministic perhaps this is the policy when you start the bottom left well because every step hurts every step has a negative reward then you want to take the shortest path to the maximum square with a maximum reward when the state space is non-deterministic as presented before with a probability of 0.8 when you choose to go up you go up but with probability 0.1 you go left and point 1 you go right unfair again much like life that would be the optimal policy what is the Keith observation here that every single state in the space must have a plan because you can't because then a non-deterministic aspect of the control you can't control where you're going to end up so you must have a plan for every place that's the policy having an action an optimal action to take in every single state now suppose we change the reward structure and for every step we take there's a negative reward is a negative 2 so it really hurts there's a high punishment for every single step we take so no matter what we always take the shortest path the optimal policy is to take the shortest path to the to the only spot on the board that doesn't result in punishment if we decrease the reward of each step to negative 0.1 the policy changes whether some extra degree of wandering encouraged and as we go further and further in lowering the punishment as before to negative 0.04 more wandering and more wandering is allowed and when we finally turn the reward into positive so every step it every step is increases the reward then there's a significant incentive to to stay on the board without ever reaching the destination kind of like college for a lot of people so the value function the way we think about the value of a state or the value of anything in the environment is the reward were likely to receive in the future and the way we see the reward were likely to receive as we discount the future award because we can't always count on it here Gama further and further out into the future more and more discounts decreases the reward the importance of the reward received and the good strategy is taking the sum of these rewards and maximizing it maximizing the scoundrel ward that's what reinforcement learning hopes to achieve and with cue learning we use any policy to estimate the value of taking an action in a state so off policy forget policy we move about the world and use the bellman equation here on the bottom to continuously update our estimate of how good a certain action is in a certain state so we don't need this this allows us to operate in a much larger state space in a much larger action space we move about this world through simulation or in the real world taking actions and updating our estimate of how good certain actions are over I'm the new state at the left is the is the updated value the old state is the starting value for the equation and we update that old state estimation with the sum of the reward received by taking action s tax action a and state us and the maximum reward that's possible to be received in the following states discounted that update is decreased with a learning rate the higher the learning rate the more value we the the faster will learn the more value we assigned to new information that's simple that's it that's Q learning the simple update rule allows us to to explore the world and as we explore get more and more information about what's good to do in this world and there's always a balance in the various problem spaces we'll discuss there's always a balance between exploration and exploitation as you form a better and better estimate of the Q function of what actions are good to take you start to get a sense of what is the best action to take but it's not a perfect sense it's still an approximation and so there's value of exploration but the better and better your estimate becomes the less and less exploration has a benefit so usually we want to explore a lot in the beginning and less and less so towards the end and when we finally release the system out into the world and wish it to operate its best then we have it operate as a greedy system always taking the optimal action according to the q2 key value function and everything I'm talking about now is permit rised and our parameters that are very important for winning the deep traffic competition which is using this very algorithm with a neural network at its core so for sin table representation of a cue function where the y-axis is state four states s one two three four and the x-axis is actions a one two three four we can think of this table as randomly initiated or initiated initialized in any kind of way that's not representative of actual reality and as we move about this world and we take actions we update this table with the bellman equation shown up top and here slides now are online you can see a simple pseudocode algorithm of how to update it how to run this bellman equation and over time the approximation becomes the optimal cue table the problem is when that cue table it becomes exponential in size when we take in raw sensory information as we do with cameras with deep crash or with deep traffic it's taking the full grid space and taking that information the raw the raw grid pixels of deep traffic and when you take the arcade games here they're taking the raw pixels of the game or when we take go the game of go when it's taking the units the the board the raw state of the board as the input the potential state space the number of possible combinations of what states it possible is extremely large larger than we can certainly hold the memory and larger that we can ever be able to accurately approximate through the bellman equation over time through simulation through the simple update of the bellman equation so this is where deep reinforcement learning comes in neural networks are really good approximate errs they're really good at exactly this task of learning this kind of cue table so as we started with supervised learning or neural networks helped us memorize patterns using supervised ground true data and we'll move to reinforcement learning that hopes to propagate outcomes to knowledge deep learning allows us to do so on much larger state spaces are much larger action spaces which means it's generalizable it's much more capable to deal with the raw stuff of sensory data which means it's much more capable to deal with the broad variation of real world applications and it does so because it's able to learn the representations as we discussed on Monday the understanding comes from converting the raw sensory information into into simple useful information based on which the action in this particular state can be taken in the same exact way so instead of the cue table instead of this cue function we plug in a neural network where the input is the state space no matter how complex and the output is a value for each of the actions that you could take input is the state output is the value of the function it's simple this is deep Q Network DQ one at the core of the success of deep mind a lot of the cool stuff you see about video games D queuing or variants of DQ and our play this is water first with a nature paper a deep mind the success came of playing the different games including Atari games how are these things trained very similar to supervised learning the bellman equation up top it takes the reward and the discounted expected reward from future states the loss function here for neural network and you'll now work learners with a loss function it takes the reward received at the current state does a forward pass through a neural network to estimate the value of the future state of the best action to take in the future state and then subtract that from the forward pass through the network for the current state in action so you take the difference between what your a Q estimator then you'll network believes the value of the current state is and what it more likely is to be based on the value of the future states that are reachable based on the actions you can take here's the algorithm input is the state output is the Q value for each action or in this diagram input is the state in action and the output is the Q value it's very similar architectures so given a transition of s a are s prime s current state taking an action receiving reward and achieving US prime state the the update is to a feed-forward pass through the network for the current state do a feed-forward pass for each of the possible actions taken in the next state and that's how we compute the two parts of the loss function and update the weights using back propagation again loss function back propagation is how the network is trained this has actually been around for much longer than the deep mind a few tricks made it made it really work experience replays the biggest one so as the games are played through simulation or if it's a physical system as it acts in the world it's actually collecting the observations into a library of experiences and that training is performed by randomly sampling the library in the past by randomly sampling the previous experiences and batches so you're not always training on the natural continuous evolution of the system you're training on randomly picked batches of those experiences that's like huge it's a it's a seems like a subtle trick but it's a really important one so the system doesn't over fit a particular evolution of this of the game of the simulation another important again subtle trick as in a lot of deep learning approaches the subtle tricks make all the difference is fixing the target network for the loss function if you notice you have to use the neural network thick the singly neural network the gqi network to estimate the value of the current state and action pair and next so using it multiple times and as you perform that operation you're updating the network which means the target function inside that loss function is always changing so you're the very nature your loss function is changing all the time as you're learning and that's a big problem for stability that can create big problems for the learning process so this little trick is to fix the network and only update it every safe thousand steps so as you train the network the the network that's used to compute the target function inside the loss function is fixed it produces a more stable computation on a loss function so the ground doesn't shift under you as you're trying to find a minimal for the loss function the loss function doesn't change in unpredictable difficult to understand ways and reward clipping which is always true with general systems that are operating it's seeking to operate in the generalized way is for very for these various games the points are different some some points are low some points are high some go positive and negative and they're all normalized to a point where the good points or the positive points are a 1 and negative points are a negative 1 that's reward clipping simplify the reward structure and because a lot of the games are 30 FPS or 60 FPS and the actions are not it's not valuable to take actions at such a high rate inside of these as particularly Atari games then you only take an action every four steps while still taking in the frames as part of the temporal window to make decisions tricks but hopefully gives you a sense of the kind of things necessary for both seminal papers like this one and for the more important accomplishment of winning deep traffic is that the tricks make all the difference here on the bottom is the circle is when the technique is used in the x1 it's not looking at replay and target takes target network and experience replay when both are used for the game of breakout River raid sea quests and Space Invaders the higher the number the better it is the more points achieved so when it gives you a sense that when replay and target both gives significant improvements in the performance of the system order of magnitude improvements two orders of magnitude for breakup and here is pseudocode of implementing dq1 the learning the key thing to notice and you can look to the slides is the the loop the while loop of playing through the games and selecting the actions to play is not part of the training it's it's part of the saving the observations the state action reward next state observation is saving them into replay memory into that library and then you sample randomly from that replay memory to then train the network based on the loss function and with probability up up top with the probability epsilon select a random action that epsilon is the probability of exploration that decreases that's something you'll see in deep traffic as well is the rate at which that exploration decreases over time through the training process you want to explore a lot first and less and less over time so this algorithm is being able to accomplish in 2015 and since a lot of incredible things things that made the AI world think that we were onto something that general AI is within reach for the first time that raw sensor information was used to create a system that acts and makes sense of the world make sense of the physics of the world enough to be able to succeed in it from very little information but these games are trivial even though there is a lot of them this dqn approach has been able to outperform a lot of the Atari games that's what's been reported on outperform the human level performance but again these games are trivial what I think and perhaps biased I'm biased but one of the greatest accomplishments of artificial intelligence in the last decade at least from the philosophical or the research perspective is alphago 0 first alphago and then alphago 0 its deepmind system that beat the best in the world in a game of go so what's the game of go it's simple I won't get into the rules but basically it's a 19 by 19 board shown on the bottom of the slide for the bottom row of the table for a board of 19 by 19 the number of legal game positions is 2 times 10 to the power of 170 it's a very large number of possible positions to consider any one time especially the game evolves the number of possible moves is huge much larger than in chess so that's why AI the community thought that this game is not solvable until 2016 when alphago used this use human expert position play to seed in a supervised way reinforcement learning approach and I'll describe in a little bit of detail and a couple of slides here to beat the best in the world and then alphago 0 that is the accomplishment of the decade for me in AI is being able to play with no training data on human expert games and beat the best in the world in an extremely complex game this is not Atari this is and this is a much higher order difficulty game and that and the quality of players that is competing in is much higher and it's able to extremely quickly here to achieve a rating that's better than alphago and better than the different variants of alphago and certainly better than the best of the human players in 21 days of self play so how does it work all of these approaches much much like the previous ones the traditional ones that are not based on deep learning are using Monte Carlo tree search MCTS which is when you have such a large state space you start at a board and you play and you choose moves with some exploitation exploration balancing choosing to explore totally new positions or to go deep in the positions you know are good until the bottom of the game is reached until the final state is reached and then you back propagate the quality of the choices you made leading to that position and in that way you learn the value of of board positions and play that's been used by the most successful go playing engines before and alphago since but you might be able to guess what's the difference with alphago verse to the previous approaches they use the neural network as the intuition quote-unquote - what are the good states what are the good next board positions to explore and the key things again the tricks make all the difference that made alphago zero work and work much better than alphago is first because there was no expert play instead of human games alphago used that very same Monte Carlo tree search algorithm MCTS to do an intelligent look ahead based on the neural network prediction of where dove the good States to take it checked that instead of human expert play it checked how good indeed are those states it's a simple look ahead action that does the ground truth that does the target correction that produces the loss function the second part is the multitask learning what's now called multitask learning is the networkers is quote-unquote two-headed in the sense that first it outputs the probability of which move to take the obvious thing and it's also producing a probability of winning and there's a few ways to combine that information and continuously train both parts of the network depending on the choice taken so you want to take the best choice in the short term and achieve the positions that are highly a slightly hood of winning for the player that's whose turn it is and another big step is that they updated from 2015 the updated of the state-of-the-art architecture which are now the architecture that one imagenet as the residual networks ResNet for imagenet those that's it and those little changes made all the difference so that takes us to deep traffic and the eight billion hours stuck in traffic America's pastime so we tried to simulate driving that behavior layer of driving so not the immediate control not the motion planning but beyond that on top on top of those control decisions the human interpretable decisions of changing lane of speeding up slowing down modeling that in a micro traffic simulation framework that's popular in traffic engineering the kind of shown here we apply deep reinforcement learning to that I'll call it deep traffic the goal is to achieve the highest average speed over a long period of time weaving in and out of traffic for students here the requirement is to follow the tutorial and achieve a speed of 65 miles an hour and if you really want to achieve a speed over 70 miles an hour which is what's acquired to win and perhaps upload your own image to make sure you look good doing it what you should do clear instructions to compete read the tutorial you can change parameters in the code box on that website cars done on mighty dad you size deep traffic click the white button that says apply code which applies the code that you write these are the parameters that you specify then you'll network it applies those parameters creates the architecture do you specify and now you have a network written in JavaScript living in the browser ready to be trained then you click the blue button that says run training and that trains the network much faster than one's actually being visualized in the browser a thousand times faster by evolving the game making decisions taking in the grid space as I'll talk about here in a second the speed limit is 80 miles an hour based on the various adjustments were made to the game reaching 80 miles an hour is certainly impossible an average and reaching some of the speeds that we've achieved last year it's much much much more difficult finally when you're happy and the training is done submit the model to competition for those super eager dedicated students you can do so every five minutes and to visualize your submission you can click the request visualization specifying the custom image and the color okay so here's the simulation speed limit 80 miles an hour cars 20 on the screen one of them is a red one in this case that's that one is controlled by a neural network its speed it's allowed the actions of speed up slow down change lanes left-right or stay exactly the same the other cars are pretty dumb they speed up slow down turn left right but they don't have a purpose in their existence they do so randomly or at least purpose has not been discovered the road the car the speed the road is a grid space an occupancy grid that specifies when it's empty it's set to a B meaning that the the grid value is whatever speed is achievable if you were inside that grid and when there's other cars that are going slow the value in that grid is the speed of that car that's the state space that's the state representation and you can choose how much what slice that state space you take in that's the input to the neural network for a visual Asian purposes you can choose normal speed or fast speed for watching the network operate and there's display options to help you build intuition about the network takes in and what space that car is operating in the default is no extra information is added then there's the learning input which visualizes exactly which part of the road the is serves as the input to the network then there is the safety system which I'll describe in a little bit which is all the parts of the road the car is not allowed to go into because it would result in a collision and that with JavaScript would be very difficult to animate and the full map here's a safety system you could think of this system as a CC basic radar ultrasonic sensors helping you avoid the obvious collisions to obviously detectable objects around you and the task for this red car for the steel Network is to move about this space is to move about the space under the constraints of the safety system the red shows all the parts of the grid it's not able to move into so the goal for the car is to not get stuck in traffic it's make big sweeping motions to avoid crowds of cars the input like DQ n is the state space the output is the value of the different actions and based on the epsilon parameter through training and through inference evaluation process you choose how much exploration you want to do these are all parameters the learning is done in the browser on your own computer utilizing only the CPU the action space there's five giving you some of the variables here perhaps you go back to the slides to look at it the brain quote unquote is the thing that takes in the state and the reward takes a four passed through the state and produce to the next action the brain is where the neural network is contained both of the training and the evaluation the learning input can be controlled in width forward length and backward length lane side number of lanes to the side that you see patches ahead as the patches ahead that you see patches behind as patches behind the you see mu this year can control the number of agents that are controlled by the neural network anywhere from one to ten and the evaluation is performed exactly the same way you have to achieve the highest average speed for the agents the very critical thing here is the agents are not aware of each other so they're not jointly jointly planning the network is trained under the joint objective of achieving the average speed for all of them but the actions are taking in a greedy way for each it's very interesting what can be learned in this way because this kinds of approaches are scalable to an arbitrary number of cars and you could imagine us plopping down the best cars from this class together and having them compete in this way the best neural networks because they're full in their greedy operation the number of networks that can concurrently operate is fully scaleable there's a lot of parameters the temporal window the layers the many layers types that can be added here's a fully connected layer with tenure ons the activation functions all of these things can be customized as specified in the tutorial the final layer a fully connected layer with output a five regression giving the value of each of the five actions and there's a lot of more specific parameters some of which have this just from gamma to epsilon to experience replay size to learning rate in temporal window the optimizer the learning rate momentum batch size l2 l1 to K for regularization and so on there's a big white button that says apply code that you press that kills all the work you've done up to this point so be careful doing it it should be doing it only at the very beginning if you happen to leave your computer running in training for several days as as folks have done the blue training button you press and it trains based on the parameters you specify and the network state gets shipped to the main simulation from time to time so the thing you see in the browser as you open up the web site is running then the same network that's being trained and regularly it updates that network so it's getting better and better even if the training takes weeks for you it's constantly updating the network you see on the left so if the car for the network that you're training is just standing in place and not moving it's probably time to restart and change the parameters maybe add a few layers to your network number of iterations is certainly an important parameter to control and the evaluation is something we've done a lot of worked on since last year to remove the degree of randomness to remove the the incentive to submit the same code over and over again to hope to produce a higher reward a higher evaluation score the method for evaluation is we collect the average speed over ten runs about 45 seconds of game each not minutes 45 simulated seconds and there is five hundreds of those and we take the median speed of the 500 runs it's done server-side so extremely difficult to cheat I urge you to try you can try it locally there's a start evaluation run but that one doesn't count that's just for you to feel better by you network that's that should produce a result that's very similar to the one we were produced on the server it's to build your own intuition and as I said we significantly reduce the influence of randomness so the the score the speed you get for the network you design should be very similar with every valuation loading is saving if the network is huge and you want to switch computers you can save the network it saves both the architecture of the network and the weights and the on the network and you can load it back in obviously when you load it in it's not saving any of the data you've already done you can't do transfer learning with javascript in the browser yet submitting your network submit model to competition and make sure you run training first otherwise it'll be initiated the way to initiate it randomly and will not do so well you can resubmit us off and you like and the highest score is what counts the coolest part is you can load your custom image specify colors and request the visualization we have not yet shown the visualization but I promise you it's going to be awesome again read the tutorial change the parameters in the code box click apply code run training everybody in this room on the way home on the train hopefully not in your car should be able to do this in the browser and then you can visualize request visualization because it's an expensive process you have to want it for us to do it because we have to run in server-side competition link is there github starter code is there and the details for those that truly want to win is in the archive paper so the question that will come up throughout is whether these reinforcement learning approaches are at all or rather if action planning control is amenable to learning certainly in the case of driving we can't do it alpha go zero did we can learn from scratch from self play because that will result in millions of crashes in order to learn to avoid the crashes unless we're working like we are deep crash on the RC car or we're working in a simulation so we can look at export data we can look at driver data which we have a lot of and learn from it's an open question whether this is applicable to date and I'll bring up two companies because they're both guest speakers deep IRL is not involved in the most successful robots operating in the real world in the case of Boston Dynamics most of the perception control and planning like in this robot does not involve learning approaches except with minimal addition on the perception side best of our knowledge and certainly the same is true with Wei MO as the speaker on Friday will talk about deep learning is used a little bit in perception on top but most of the work is done from the sensors and the optimization base the model-based approaches trajectory generation and optimizing which trajectory trajectory is best to avoid collisions deep IRL is not involved and coming back and back again the unexpected local POC is a high reward which arises in all of these situations and apply in the real world so for the cat video that's pretty short where the cats are ringing the bell and they're learning that the ring in the bell is is mapping to food I urge you to think about how that can evolve over time in unexpected ways they may not have a desirable effect where the final reward is in the form of food and the intended effect is to ring the bell that's ASAT comes in for the artificial general intelligence course in two weeks that something will explore extensively its how these reinforcement learning planning algorithms will evolve in ways they're not expected and how we can constrain them how we can design reward functions that result in safe operation so I encourage you to come to the talk on Friday at 1:00 p.m. as a reminder so 1:00 p.m. not 7:00 p.m. in Stata 32 one two three and two the awesome talks in two weeks from Boston Dynamics to Ray Kurzweil and so on for AGI now tomorrow we'll talk about computer vision and psyche fuse thank you everybody [Applause]
Resume
Categories