MIT 6.S094: Introduction to Deep Learning and Self-Driving Cars
1L0TKZQcUtA • 2017-01-16
Transcript preview
Open
Kind: captions Language: en all right hello everybody hopefully you can hear me well yes yes great so welcome to course 6s 094 deep learning for self-driving cars uh we will introduce to you the methods of deep learning of deep neural networks using the guiding case study of building self-driving cars my name is Lex Friedman uh you get to listen to me for majority of these lectures and I am part of an amazing team uh with some brilliant T would you say brilliant so Dan um Dan Brown you guys want to stand up you okay they're in front row Spencer uh William Angel Spencer Dot and all the way in the back uh the smartest and the tallest person I know Ben Benedict uh genck so what you see there on the left of the slide is a uh visualization of one of the two projects that one of the two simulations games that will'll get to go through we use it as a way to teach you about deep reinforcement learning but also as a way to excite excite you by challenging you to compete against others if you wish uh to win a special prize yet to be announced super secret prize so you can reach me and the taas at decars mit.edu if you have any questions about the tutorials about the lecture about anything at all the website cars. mit.edu has the lecture content code tutorials again um like today the lecture slides for today are already up in PDF form uh this the the slides celles if you want to see them just email me but they're uh over a gigabyte in size because they're very heavy in videos so I'm just posting the PDFs and there will be lecture videos available a few days after the lecture is given so speaking of which there is a camera in the back this is being videotaped and recorded but for the most part the camera is just on the speaker so you shouldn't have to worry if that kind of thing wores you then uh you could sit on the periphery of the classroom or maybe I suggest sunglasses and a mustach fake mustache that would be a good idea there is a competition for the game that you see on the left I'll describe exactly what's involved in order to get credit for the course you have to design a neural network that drives the car just above the speed limit 65 miles hour but if you want to win you need to go a little faster than that so what who this class is for you may be new to programming new to machine learning new to robotics or you're an experts in those fields but want to go back to the basics so what you will learn is an overview of deep reinforcement learning a convolution on your own networks recurring your old networks and how these methods can help improve each of the components of autonomous driving perception visual perception localization mapping control planning and the detection of driver State okay two projects uh code name deep traffic is the first one uh there is in this particular formulation of it there is seven Lanes uh it's a top view uh it looks like a game but uh I assure you it's very serious it is uh the agent in red the car in red is being controlled by neural network and we'll explain how you can control and design the various aspects the various parameters of this neural n this neural network and it learns in the browser so this we're using compet JS which is a library that is programmed by Andre karthy in JavaScript so amazingly we live in a world where you can train in a matter of minutes a neural network in your browser and we'll talk about how to do that the reason we did this is so that there is very few requirements for get you up and started with your own networks so in order to complete this project for the course you don't need any requirements except to have a Chrome browser and uh to win the competition you don't need anything except the Chrome browser the second project code named deep Tesla or Tesla is using uh data from a Tesla vehicle of the forward roadway and using an learning taking the image the and putting it into convolutional neural networks that directly Maps a regressor that maps to a steering angle so all it takes is a single image and it predicts a steering angle for the car and you we have data for the car itself and you get to build a neural network that tries to do better tries to steer better or at least as good as the car okay let's get started with the question with the thing that we understand so poorly at this time because it's so shoud in mystery but it fascinates many of us and that's the question of what is intelligence this is from a 1996 March 1996 Time Magazine and the question C machinists think is answered below where they already do so what if anything is special about the human mind it's a good question for 1996 a good question for 2016 17 now and the future and there's two ways to ask that question one is the special purpose version can artifici can an artificial intelligent system achieve a well-defined specifically uh formally defined finite set of goals and this uh little diagram from a book that got me into artificial intelligence as a bright-eyed high school student the art artificial intelligence just a modern approach this is a uh beautifully simple diagram of an intell of a system it exists in an environment it has a set of sensors that do the perception it takes Those sensors in does something magical there's a question mark there and with a set of effectors Acts in the world manipulates objects in that world and so special Purp purpose we can under this formulation as long as the environment is formly defined well defined as long as a set of goals are well defined as long as the set of actions sensors and the ways that the perception carries itself out is well defined we have good algorithms of which we'll talk about that can optimize for those goals the question is if we inch along this path will we get closer to the general formulation to the general purpose version of what artificial intelligence is can it achieve poorly defined unconstrained set of goals with an unconstrained poorly defined set of actions and unconstrained poorly defined utility functions rewards this is what human life is about this is what we do pretty well most days exists in a undefined full of uncertainty world so okay we can separate tasks into uh different three different categories formal tasks this is the easiest it doesn't seem so it didn't seem so at the birth of artificial intelligence but that's in fact true if you think about it the easiest is the formal tasks playing board games theorem proving all the kind of mathematic iCal logic problems that can be formal formally defined then there's the expert expert tasks so this is this is where a lot of the exciting breakthroughs have been happening where machine learning methods datadriven methods can help aid or improve on the performance of our human experts this means medical diagnosis Hardware design scheduling and then there is the thing that we take for granted the trivial thing the thing that we do so easily every day when we wake up in the morning the mundane tasks of everyday speech of written language of visual perception of walking which we'll talk about in today's lecture is a fascinatingly difficult task an object manipulation so the question is that we're asking here before we talk about deep learning before we talk about the specific methods we really want to dig in and and and try to see what is it about driving how difficult is driving is it is it more like chess which you see on the left there where we can formally Define a set of lanes a set of actions and uh formulate it as this you know there's five set of actions you can change a lane you can avoid obstacles there's you can formally Define an obstacle you can formally Define the rules of the road or is is there something about natural language something similar to everyday conversation about driving that's requires a much higher degree of reasoning of communication of learning of existing in this underactuated space is it a lot more than just left lane right lane speed up slow down so let's look at it as uh as a chess game here's the chess pieces what is uh what are the sensors we get to work with on a self of on an autonomous vehicle and we'll get a lot more in depth on this especially with the guest speakers who built many of these there's radar there's the range sensors radar liar that give you information about obstacles in an environment that help localize the obstacles in the environment there's the visible light camera and stereo Vision that gives you texture information that helps you figure out not just where the obstacles are but what they are helps to classify those helps to understand their subtle movements then there is the information about the vehicle itself about the trajectory and the movement of the vehicle that comes from the GPS and IMU sensors and there is the state of the the rich Rich state of the vehicle itself what is it doing what are all the individual systems doing that comes from the can Network and there there is one of the less studied but fascinating to us on the research side is audio the sounds of the road that provide the rich context of a wet Road the sound of a road that when it stop raining but it's still wet the sound that it makes the screeching tire and the and honking these are all fascinating signals as well and the focus of the research in our group the thing that's really much under uh investigated is the internal facing sensors the driver sensing the the state of the driver where they looking are they sleepy the emotional state are they in the seat at all and the same with audio that comes from the visual information and the audio information more than that here's the tasks if you were to break into modules the task of what it means to build a self-driving vehicle first you want to know where you are where am I localization and mapping you want to map the external environment figure out where all the different uh uh obst are all the entities are and use that estimate of the environment to then figure out where I am where the robot is then there's seen understanding is understanding not just the positional aspects of the uh external environment and the Dynamics of it but also what those entities are is it a car is it a pedestrian is it a bird there's movement planning once you have kind of figured out to the best of your abilities your position and the position of other entities in this world there's figuring out a trajectory through that world and finally once you've figured out how to move about safely and effectively through that world is figuring out what the human that's on board is doing because as I will talk about the path to a self-driving vehicle and that is hence our focus on Tesla may go through semi-autonomous Vehicles where there is a where the vehicle must not only drive itself but effectively hand over control from the car to the human and back okay quick history well there's a lot of fun stuff from the 80s and 90s but the the the big breakthroughs came when in the second DARPA Grand Challenge with Stanford Stanley when they won the competition one of five cars that finished this was an incredible accomplishment in a desert race a fully autonomous vehicle was able to complete the race in record time uh the Dara Urban challenge in 2007 where the race was the uh the task was no longer race through the desert but through a urban environment and cmu's boss with GM won that race and a lot of that work led directly into the uh acceptance and large major industry players taking on the challenge of building these vehicles Google Now wh Mo self-driving car Tesla with its autopilot system and now autopilot 2 system Uber with its testing in uh Pittsburgh then there's many other companies including one of the speakers for this course of nutonomy that are driving the the wonderful streets of Boston okay so let's take a step back we have if we think about the accomplishments in the Dara Challenge and if we look at the accomplishments of the Google self driving car which essentially boils the world down into a chess game it uses incredibly accurate sensors to build a three-dimensional map of the world localize itself effectively in that world and move about that world in a very well- defined way now what if driving the open question is if driving is more like a conversation like a natural language conversation how hard is it to pass the touring test the touring test as the popular current formulation is can a computer M be mistaken for a human being in more than 30% of the time when a human is talking behind a veil having a conversation with either a computer or human can they mistake the other side of that conversation for being uh a human when it's in fact a computer and the the way you would in a natural language build a system that has successfully passes the T in test is the natural language part processing part to enable it to communicate successfully so generate language and interpret language then you represent knowledge the state of the conversation transferred over time and the last piece and this is the hard piece is the automated reasoning is reasoning can we teach machine learning methods to reason that's that is something that will propagate through our discussion because as I will talk about the various methods the various deep learning methods neural networks are good at learning from data but they're not yet there is no good mechanism for reasoning now reasoning could be just something that we tell ourselves we do to feel special better to feel like we're better than machines reasoning may be simply something as simple as learning from data we just need a larger Network or there could be a totally different mechanism required and we'll talk about the possibilities there yes which is that the US no it's it's very difficult to find these kind of situations in the United States so the question was for this video is it in the United States or not um I believe it's uh in um Tokyo so India uh a few European countries uh are much uh are much more towards the direction of uh natural language versus chess in the United States uh generally speaking we follow rules more concretely the quality of Roads is better the marking on the roads is better so there's less requirements there the these cars are driving the left side I see I just okay yeah you're right it is cuz uh yep yeah so but it's certainly not the United States I'm pretty uh it's spent quite a bit of Googling uh trying to find the United States and it's it's difficult so let's talk about the recent breakthroughs in machine learning and what is at the core of those breakthroughs is neural networks that have been around for a long time and I will talk about what has changed what are the cool new things and what has hasn't changed and what are its possibilities but first a neuron crudely is a computational building block of the brain I know there's a few folks here Neuroscience folks this is uh hardly a model it is mostly an inspiration and so the the human neuron has inspired the artificial neuron the computational building block of a neural network of an artificial neural network I to give you some context these neurons for both artificial and human brains are interconnected in the human brain there's about I believe 10,000 outgoing connections from every neuron on average and they're interconnected to each other our the largest current as far as I'm aware are artificial neural network has 10 billion of those connections synapses our human brain to the best estimate that I'm aware of has 10,000 times that so 100 to 1,000 trillion synapses now what is a artificial neuron this building block of a neural network it takes a set of inputs it puts a weight on each of those inputs sums them together applies a bias value on each that that sits on each neuron and using an activation function that takes as input the that sum plus the bias and squishes it together to produce a zero to one signal and this allows us a single neuron take a few inputs and produces an output a classification for example a 01 and as as we'll talk about simply it it can serve as a linear classifier so it can draw a line it can learn to draw a line between like what's seen here between the blue dots and the yellow dots and that's exactly what we'll do in the iyon notebook that I'll talk about but the basic algorithm is you initialize the the weights on the inputs and you compute the output you perform this previous operation I talked about sum up and compute the output and if the output does not match the ground truth the expected output the output that it should produce the weights are punished accordingly and we'll talk through a little bit of the math of that and this process is repeated until the perceptron does not make any more mistakes now here's uh the amazing thing about neural networks there's several and I'll talk about them uh one on the mathematical side is the universality of neural networks with just a single layer if we stack them together a single hidden layer the inputs on the left the outputs on the right and in the middle there's a single hidden layer it can closely approximate any function any function so this is an incredible property that with a single layer any function you could think of that you know you can think of driving as a function it takes an input the the world outside as output the control of the vehicle there exists in your own network out there that can drive perfectly it's a fascinating mathematical fact so we can think of this then these functions as a special purpose function special purpose intelligence you can take U say as input uh the number of bedrooms the square feet the type of neighborhood those are the three inputs it passes that value through to the hidden layer and then one more step it produces the final price estimate for the house or for the residents and we can teach a network to do this pretty well in a supervised way this is supervised learning you provide a lot of examples where you know the number of bedrooms the square feet the type of neighborhood and then you also know the final price of the House of the residence and then you can as I'll talk about through a process of back propagation teach these networks to make this prediction pretty well now some of the exciting breakthroughs recently have been in the general purpose intelligence this is from Andre karthy who's now at open AI I would like to take a moment here to try to explain how amazing this is this is a game of pong uh if you're not familiar with pong uh there's two paddles and you're trying to uh bounce the the ball back and uh in such a way that prevents the other guy from bouncing the the wall the ball back at you on the the the pl the the artificial intelligence agent is on the right in green and up top is the score 8 to1 now this takes about 3 Days To Train on a regular computer this network what is this network doing it's called a policy Network the input is the raw pixels they're they're they're slightly uh processed and also you take the difference between uh two frames but it's basically the raw pixel information that's the input there's a few hidden layers and the output is a single probability of moving up that that's it that's that's the whole that's that's the whole system and what it's doing is it learns not you don't know at any one moment you don't know what the right thing to do is is it to move up is it to move down you only know what the right thing to do is by the fact that eventually you win or lose the game so this is the amazing thing here is there's no no supervised learning about there's no like Universal fact about any one state being good or bad and any one action being good or bad in any state but if you punish or reward every single action you took every single action you took for entire game based on the result so no matter what you did if you won the game The end justifies the means if you won the game G every action you took and every every action uh State pair gets rewarded if you lost the game it gets punished and this process with only 200,000 games where the the the system just simulates the games it can learn to beat the computer this system knows nothing about pong nothing about games this is general intelligence except for the fact that it's just a game of pong and I will talk about how this can uh be extended further and why this is so promising and why this is also we should proceed with caution so again there's a set of actions you take up down up down based on the output of the network there's a threshold given the probability of moving up you move up or down based on the output of the network and you have a set of states and every single state action pair is rewarded if there's a win and it's punished if there's a loss when when you go home think about how amazing that is and if you don't understand why that's amazing uh spend some time on it it's incredible sure sure thing uh the question was uh what is supervised learning what is unsupervised learning what's the difference so supervised learning is when people talk about machine learning they mean supervised learning most of the time supervised learning is learning from data it's learning from example when you have a set of inputs and a set of outputs that that you know are correct what are called Ground truth so you need those examples a large amount of them to train any of the machine learning algorithms to learn to then generalize that to Future examples this is actually there's a third one called reinforcement learning where the ground truth is sparse the information about when something is good or not the ground truth only happens every once in a while at the end of the game not every single frame and unsupervised learning is when you have no information about the outputs that are correct or incorrect and it is the the the excite M of the deep Learning Community is unsupervised learning but it has achieved no major breakthroughs at this point this is the I'll talk about what the future of deep learning is and a lot of the people that are working in the field are excited by it but right now every excite any interesting accomplishment is has to do with supervised learning you talk about so I guess the the green one right yeah and the BR one is just a teristic solution like look at the velocity so basically the reinforcement learning here is learning from somebody who has certain rules and how can that be guaranteed that it would generalize to somebody else for example uh so this it's a uh the question was uh this uh the paddle the green paddle learns to play this game successfully against this specific One Brown paddle with us operating under specific kinds of rules how do we know it can generalize to other games other things and it can't but the mechanism by which it learns generalizes so as long as you let it play as as long as you let it play in whatever world you want it to succeed in long enough it will use the same approach to learn to succeed in that world the problem is you this works for worlds you can simulate well unfortunately uh one of the big challenges of neural networks is that we're not they're not currently efficient Learners we need a lot of data to learn anything human beings need like need one example often times and they learn very efficiently from that one example so uh and again I'll I'll I'll I'll I'll talk about that as well it's a good question so the drawbacks of neural networks so if you think about the way a human being would approach this game this game of pong they would only need a simple set of instructions you're in control of a paddle and you can move it up and down and your task is to bounce the ball past the other player uh controlled by AI now the human being would immediately they may not win the game but they would immediately understand the game and will be able to successfully play it well enough to pretty quickly learn to beat the game but they need to have a concept of control what it means to control a paddle they need to have a concept of a paddle they need to have a concept of moving up and down and a ball and bouncing they have to know they have to have a at least a loose concept of real world physics that they can then project that real world physics onto the two-dimensional world all of these concepts are uh are Concepts that you come to the table with that's knowledge and the kind of way you transfer that Knowledge from uh from your previous experience from from childhood to now when you come to this game that is something is called reasoning whatever reasoning means and the question is whether through this same kind of process you can see the entire world as a game of pong and reasoning is simply ability to simulate that game in your mind and learn very efficiently much more efficiently than 200,000 iterations the other challenge of deep neural networks and machine learning broadly is you need big data inefficient Learners as I said that data also needs to be supervised data you need to have ground truth which is very costly for so annotation a human being looking at a particular image for example and labeling that as something as as a cat or a dog whatever object is in the image that's very costly and for particularly for uh neural networks there's a lot of parameters to tune there's a lot of hyperparameters you need to figure out the network structure first how does this network look how many layers how many hidden nodes what type of uh what type of activation function on each node there's a lot of hyper parameters there and then once you built your network there is parameters for how you teach that Network there's learning rate loss function mini badge size number of training iterations uh gradient update smoothing and the selecting even the uh Optimizer with which you uh with which you solve the various differential equations involved it's a topic of many research papers certainly it's Rich enough for research papers but it's also really challenging it means that you can't just plop a network down and it will solve the problem generally and defining a good loss function or in the case of pong or games a good reward function is difficult so here's a game this is a recent result from open AI and I'm teaching a uh a network to play the game of Coast Runners and the goal of Coast Runners is to go you're in a boat you're the task is to go around a track uh and successfully complete a race against other people you're racing against now this network is an optimal one and what has figured out that actually in the game G it gets a lot of points for collecting certain objects along the path so what you see is it's figured out to go in a circle and collect those those green uh turbo things and what it's figured out is you don't need to complete the game to earn the reward now that more sort of there uh and despite being on fire and hitting the wall and going through this whole process it's actually uh achieved at least a local Optima given the reward function of maximizing the uh the number of points and so the it's it's figured out a way to earn a higher reward while ignoring the implied bigger picture goal of finishing the race which us as humans uh understand much better this this raises for self-driving cars ethical questions besides other questions you can watch this for hours and he will do that for hours and that's the point is um it it's it's hard to teach it's hard to encode the formally Define a utility function under which an intelligence system needs to operate and that's made obvious even in a simple game and so what what is yep question so the question was what's an example of a local Optimum that uh an autonomous car so similar to the coast race so what would be the example in the real world for an autonomous vehicle and it's a touchy subject but it would certainly have to be involved the the choices we make under near crashes and crashes the choices a car makes when to avoid uh for example if the if there's a crash imminent and there's no way you can stop to prevent the crash do you do you keep the driver safe or do you keep the other uh people safe and there has to be some even if you if even if you don't choose to acknowledge it even if it's only in the data and the learning that you do there's an implied reward function there and we need to be aware of that reward function is because he may it may find something until you actually see it we won't know it once we see it we'll realize that oh that was a bad design and that's the scary thing it's hard to know ahead of time what that is uh so the recent breakthroughs from Deep learning came several factors first is the compute More's law CPUs are getting faster 100 times faster every decade then there's gpus also the ability to train neural networks and gpus and now as6 has has created a lot of uh capabilities in terms of Energy Efficiency and uh uh being able to train larger networks more efficiently there is larger well first of all in the in the 21st century there's digitized data there's larger data sets of Digital Data and now there is that data uh is becoming more organized not just vaguely uh a ailable data out there on the internet it's actual organized data sets like imet certainly for natural language there's large data sets there is the algorithm Innovations back prop back propagation convolution neur networks lstms all these different architectures for dealing with specific uh specific types of domains and tasks there's the huge one is infrastructure is uh on the software and the the hardware side there's git ability to share an open source way software there is pieces of software that uh uh that make Robotics and make machine learning easier Ross tensor flow there's a Amazon Mechanical turque which allows for efficient cheap annotation of large scale data sets it's a AWS in the cloud hosting uh machine learning hosting the data and the compute and then there's a financial backing of large companies Google Facebook Amazon but really nothing has changed there really has not been any significant breakthroughs we're using the ex convolutional neural networks have been around since the '90s neural networks have been around since the 60s there's been a few improvements but the hope is that's in in terms of methodology the compute has really been the Workhorse the ability to do uh the hundredfold Improvement every decade holds promise and the question is whether that reasoning thing I talked about is all you need is a larger Network that is the open question so some terms for deep learning first of all deep learning is a PR term for neural networks it is a term for utilizing it's a for uh for deep neural networks for neural networks that have many layers it a symbolic term for the newly gained capabilities that compute has brought us that that training on gpus have brought us so deep learning is a subset of machine learning there's many other methods that are are still effective the terms that we that'll come up in this class is first of all multi-layer perceptron deep neural networks recurrent neural networks lstm long short-term memory networks CNN or con Nets convolutional neural networks deep belief networks and the operation that'll come up is convolution pooling activation functions and back propagation yep cool question so the question was what is the purpose of the different layers inur Network what does it mean to have one configuration versus another so a neural network having several layers it's the only thing you have an understanding of is the inputs and the outputs you don't have a good understanding about what each layer does they're mysterious things neural networks so I'll talk about how with every layer it forms a higher level a higher order representation of the input so it's not like uh the first layer does localization the second layer does path planning the third layer uh does navigation how you get from here to Florida or maybe it does but we don't know so we know we're beginning to visualize neural networks for simple tasks like for image net classifying cats versus dogs we can tell what is the thing that the first layer does the second layer the Third layer and we'll look at that but for driving when you as the input provide just the images and the output the steering it's still unclear what you learn partially because we don't have neural networks that drive successfully yet feed neur the neuron layers or does it eventually generate them on its own over time so this uh this is uh the question was do uh do you uh does a neural network generate layers over time like does it grow it that's one of the challenges is that a neural network is predefined the architectures the number of nodes number of layers that's all fixed unlike the human brain where neurons die and are born all the time neural network is pre-specified that's it that's all you get and if you want to change that you have to change that and then retrain everything so it's fixed so what I encourage you is to proceed with caution because there's this feeling when you first teach a network with very little effort how to do some amazing task like classify a face uh versus nonface or your face versus other faces or cats versus dogs it's an incredible feeling and then you there's there's definitely this feeling that I'm an expert but what you realize is it it's we you don't actually understand how it works and getting it to perform well for more generalized tasks for larger scale data sets for more useful applications requires a lot of hyperparameter tuning figuring out how to tweak little things here and there and still in the end you don't understand why it works so damn well so deep learning these deep neural network architectures is representation learning this is the difference between uh traditional machine learning methods where for example for the task of having an image here is the input the input to the networks here is on the bottom the output is up at top so and the input is a single image of a person in this case and so the input specifically is all of the pixels in that image RGB the different colors of the pixels in the image and over time what what a network does is build a multi-resolution representation of this data the first layer builds uh learns the concept of edges for example the second layer starts to learn composition of those edges corners Contours then it starts to learn about object parts and finally actually provide a label for the entities that are in the input and this is the difference between traditional machine learning methods where the concepts like edges and corners and Contours are manually pre-specified by human by human beings human experts for the particular domain and representation matters because figuring out a line for the cartisian coordinates of this particular data set where you want to design a machine Learning System that tells the difference between green triangles and blue circles is difficult there's no line that separates them cleanly and if you were to ask a human being a human expert in the field to try to draw that line they would probably do a PhD on it and still not succeed but a neural network can automatically figure out to remap that input into po polar coordinates where the representation is such that it's an easily linearly separable data set and so deep learning is a subset of representation learning is a subset of machine learning and a key subset of artificial intelligence now because of this because of its ability to compute an arbitrary number of features that at the core of the represent presentation so you're not if you were trying to detect a cat in an image you're not specifying 215 specific features of cat ears and whiskers and so on that a human expert would specify you allow an newal Network to discover tens of thousands of such features which maybe for cats you are an expert but for a lot of objects you may never be able to sufficiently provide the the the features which success UL would be used for identifying the object and so this kind of representation learning one is easy in the sense that all you have to provide is inputs and outputs all you need to provide is a data set that you care about without hand engineering features and two because of the uh its ability to construct arbitrarily sized representations Jeep neural networks are hungry for data the more data we give them the more they're able to learn about this particular data set so let's look at some applications first some cool things that deep neural networks have been able to accomplish up to this point let me go through them first the basic one Alex net uh is for imag net is a famous data set and it's a competition of classification and localization where the task is given an image identify what are the five most likely things in that image and what is the most likely and you have to do so correctly so on the right there's an image of a leopard and you have to correctly classify that that is in fact a leopard so they're able to do this pretty well given a specific image determine that it's a leopard and we started what's shown here on the x- axis is years on the y- axis is error in classification so starting from 2012 on the left with alexnet and today the errors decreased from 16% and 40% before then with traditional methods have decreased to below 4% so human level performance if I were to give you this picture of a leopard there's a four four% for 4% of those pictures of leopards you would not say it's a leopard that's human level performance so for the first time in 2015 uh convolution your networks outperform human beings that in itself is incredible that's something that seemed impossible and now is because it's done is not as impressive but I just want to get to why that's so impressive because computer vision is hard now we as human beings have evolved visual perception over millions of years hundreds of millions of years so we take it for granted but computer vision is really hard visual perception is really hard there is illumination variability so it's the same object the only way we tell anything is from the shade the reflection of light from that surface it's it could be the same object with drastically in terms of pixels drastically different looking shapes and we still know it's the same object there is POS variability and occlusions probably my favorite caption for an image uh for a figure in a academic paper is deformable and truncated cat these are pictures you know cats are famously uh are deformable they take a lot of different shapes it's it's arbitrary poses are are uh are possible so you have to have computer vision needs to not still the same object still the same class of objects given all the variability in the posst and occlusions there a huge problem we still know it's an object we still know it's a cat even when parts of it are not visible and sometimes large parts of it are not visible and then there's all the intraclass variability in intraclass all of these on the top two rows are cats many of them look drastically different and the top bottom two rows are dogs also look drastically different and yet some of the dogs look like cats some of the cats look like dogs and as human beings are pretty good at telling the difference and we want computer vision to do better than that it's hard so how is this done this this is done with convolutional networks the input to which is a raw image here is an input on the left of a number three and I'll talk about through convolutional layers that image is processed passed through convolutional layers maintain spatial information on the output they in this case predicts which of uh the images uh what number is shown in the image 0 1 2 through 9 and so this these networks that is this is exactly everybody is using the same kind of network to determine exactly that input is an image output is a number uh and in the case of you know probability that it's a leopard what is that number then there's segmentation built on top of these convolutional networks where you you chop off the end and and convolution eyesee the network you chop off the end where the output is a heat map uh so you can have instead of a a detector for a cat you can do a cat heat map where it the part of the image of the output heat map gets excited the neurons on that output get excited in the spatially spatially excited uh in the parts of the image that contain a tabby cat and this kind of process can be used to segment the image into different objects a horse so the original input on the left is a woman on a horse and the output is a is a fully segmented image of knowing where's the woman where's the horse and this kind of process can be used for object detection which is the task of detecting an object in an image now the traditional methods with with convolution real networks and in general in computer vision is is the sliding window approach where you have a detector like the leopard detector that you slide through the image to find where in that image is a leopard this the segmenting approach the rcnn approach is efficiently segment the image in such a way that it can propose different parts of the image that are likely to have a leopard or in this case a cowboy and that drastically reduces the computation of requ re ments of of the object detection task and so the these networks this is the uh currently one of the best networks for the image net task of localization is the de uh deep residual networks they're deep so vgg19 is one of the famous ones vggnet you're starting to get above 20 layers in many cases 34 layers is the reset one so the lesson there is the deeper you go the more representational power you have the higher accuracy but you need more data other applications colorization of images so this again input is a single image and output is a single image so you can take a black and white video from a film from an old film and recolor it and all you need to do to train that Network in the supervised way is provide modern films and convert them to grayscale so now you have arbitrarily sized data sets that are able to uh data sets of gray scale to color and you're able to with very with very little effort on top of it to successfully well somewhat successfully recolor images again Google translate does image translation in this way image to image it first perceives here in uh German I believe fam was German correct me if I'm wrong uh dark chocolate written in German on a box so this can take this image detected different letters convert them to text translate the text and then using the image to image mapping map the letters the translated letters back onto the box and you can do this in real time uh on video so what we've talked about up to this point on the left are vanilla neural networks convolution neural networks that map a single input to a single output a single image to a number single image to another image then there is recurrent neural networks the map this is the more General formulation that map a sequence of images or a sequence of words or a sequence of any kind to another sequence and these networks are able to do incredible things with natural language with with video and anytime series data for example we can convert text to handwritten digits with with hand handwritten text here we type in and you could do this online type in deep learning for self-driving cars and it will use an arbitrary handwriting style to generate the words deep learning for self driving cars this is done using a recurr on your own networks we can also take car rnns they're called this character level recurrent neural networks that train on a data set at an arbitrary Text data set and learn to generate text one character at a time so there is no preconceived syntactical semantic structure that's provided to the network it learns that structure so for example you can train it on Wikipedia articles like in this case and it's able to generate successfully not only text that makes some kind of uh grammatical sense at least but also keep perfect syntactic structure for Wikipedia for markdown editing for latch editing and so on this text says naturalism a decision for the majority of Arab countries Capital Li whatever that means was grounded by the Irish Language by John Clair and so on these are sentences if you didn't know better that might sound correct and it does so let me pause one character at a time so there is these aren't words being generated this is one character you start with a beginning three letters Nat you generate you completely without knowing of the word naturalism this is incredible you can do this to start a sentence and let the your will complete that sentence so for example if you start the sentence with life is or life is about actually it will complete it with a lot of fun things the weather life is about kids life is about the true love of Mr Mom is about the truth now and this is from uh Jeffrey Hinton the last two if you start with the meaning of life it can complete that with the meaning of life is literary recognition maybe true for some of us here publish a parish and the meaning of life is the tradition of ancient human reproduction also true for some of us here I'm sure okay so what else can you do you can uh this has been very exciting recently is image caption recognition uh generation I'm sorry image caption uh generation is an is important for uh for large data sets of images where we want to be able to determine what's going on inside those images uh especially for search if you want to find a man sitting in a couch with a dog you type it into Google and it's able to find that so here shown in in Black text a man sitting on a couch with a dog is generated by the system a man sitting in a chair with a dog in his lap is generated by a human Observer and again these annotations are done by detecting the different obstacles uh the different objects in the scene so segmenting the scene detecting on the right there's a woman a crowd a cat a camera holding you know purple all of these words are being detected then a syntactically correct sentence is generated a lot of them and then you order which sentence is the most likely and in this way you can generate very accurate uh labeling of the images captions for the images and you could do the same kind of process for image question answering you can ask how many so quantity how many chairs are there you can ask about location where where are the right bananas you can ask about the type of object what is the object on the chair it's a pillow and these are again using the recurr on NE networks uh you could do the same thing with uh with the video caption Generation video caption uh description generation so looking at had a sequence of images as opposed to just a single image what is the action going on in this uh in this situation this is the difficult task there's a lot of work in it uh in this area now for on the left is correct descriptions of a man is doing stunts on his bike a heard of zebras are walking in the field and on the right there's a small bus running into a building you know it's talking about relevant entities but just doing an incorrect description uh a man is cutting a piece of a piece of a pair of a paper he's cutting a piece of a pair of a paper so the words are correct perhaps but so you're close but uh no cigar so one one of the interesting things you can do with the recal networks is if you think about the way we look at images human beings look at images is is uh we only have a small phobia with which we focus on on in the scene so right now your periphery is very distorted the only thing if you're looking at the slides or you're looking at me that's the only thing that's in Focus majority of everything else is out of focus so we can use the same kind of concept to try to teach a Neal Network to steer around the image both for perception and generation of those images this is important first on the general artificial intelligence point of it being just fascinating that we can SEL selectively steer our attention but also it's important for things like drones that have to fly at high speeds in an environment where at 300 plus frames a second you have to make decisions so you can't possibly localize yourself or perceive the world around yourself successfully if you have to interpret the entire scene so what you can do is you can steer for example here shown is reading reading uh house numbers uh by steering around an image you could do the same task for reading and for writing so reading numbers here on the mes data set on the left is reading numbers but you can also selectively Mo steer a steer Network around an image to generate that image starting with a blurred Image First and then getting more and more higher resolution as the steering goes on work here at MIT is able to map video to audio so H stuff with a drumstick Silent Video and able to generate the sound that would drumstick hitting that particular object makes so you can get texture information from that uh impact so here is is a video uh of a human soccer player playing soccer and a state-ofthe-art machine playing soccer and well let me give him some time to build up [Laughter] okay so uh soccer this is we take this for granted but walking is hard object manipulation is hard soccer is harder than chess for us to do much harder on your phone now you have you you can have a chess engine that beats the best players in the world and you have to internalize that because the question is this is a painful video but the the question is where does driving fall is it closer to chess or is it closer to soccer for those incredible brilliant Engineers that worked on the most recent DARPA challenge this would be is very painful video to watch I apologize this is
Resume
Categories