Kind: captions Language: en all right welcome back everyone sound okay all right so today we will we talked a little bit about neural networks started to talk about neural networks yesterday today we'll continue to talk about neural networks that work with images convolutional networks and see how those types of networks can help us drive a car if we have time we'll cover a simple illustrative case study of detecting traffic lights the problem of detecting green yellow red if we can't teach our neural networks to do that we're in trouble but it's a good clear illustrative case study of a three class classification problem okay next there's deep Tesla here looped over and over in a very short GIF this is actually running live on a website right now we'll show it towards the end of the lecture this once again just like deep traffic is a neural network that learns to steer a vehicle based on the video of the forward roadway and once again doing all of that in the browser using JavaScript so you'll be able to train your own very Network to drive using real world data I'll explain how you will we will also have a tutorial and code briefly describe today at the end of lecture if there's time how to do the same thing in tensor flow so if you want to build a network that's bigger deeper and you want to utilize gpus to train that Network you want to not do it in your browser you want to do it offline using tensor flow and having a powerful GPU on your computer and we'll explain how to do that computer vision so we talked about vanilla machine learning where there's no where the size yesterday where the size of the input is small for the most part the number of neurons in the case of the neural network is on the order of 10 100 1,000 when you think of images images are a collection of pixels one of the most iconic images from computer vision in the bottom left there is Lena I encourage you to Google it and uh figure out the story behind that image is uh quite shocking when I found out recently so once again computer vision is these days dominated by a data driven approaches by Machine learning where all of the same methods that are used on other types of data are used on images where the input is just a collection of pixels and pixels are numbers from 0 to 255 discrete values so we can think exactly what we've talked about previously we can think of images in the same exact way it's just numbers and so we can do the same kind of thing we can do supervised learning where you have an input image an output label the input image here is a picture of a woman the label might be woman on supervised learning same thing we'll look at that briefly as well is clustering images into categories again semi-supervised and reinforcement learning in fact the Atari games I talked about yesterday do some pre-processing on the images they're doing computer vision they're using convolutional neur networks as we'll discuss today and the pipeline for supervised learning is again the same there's raw data in the form of images there's labels on those images we perform a machine learning algorithm performs feature extraction it trains given the inputs and outputs on the images and the labels of those images constructs a model and then test that model and we get a metric an accuracy accuracy is the term that's used to often describe how well a model performs it's a percentage I apologize for the constant presence of cats throughout this course I assure you this course is about driving not cats but images are numbers so for us uh we take it for granted we're really good at looking uh at and converting visual perception as human beings converting visual perception into semantics we see this image and we know it's a cat but a computer only sees numbers RGB values for colored image there's three values for every single Pixel from 0 to 255 and so given that image we can think of two problems one is regression and the other is classification regression is when given an image we want to produce a real valued output back so if we have an image of the forward roadway we want to produce a value for the steering wheel angle and if you have an algorithm that's really smart it can take any image of the forward roadway and produce the perfectly correct steering angle that drives the car safely across the United States we'll talk talk about how to do that and where that fails classification is when the input again is an image and the output is a class label a discrete class label underneath it though often is still a regression problem and what's produced is a probability that this particular image belongs to a particular category and we use a threshold to chop off the the out puts associated with low probabilities and take the labels associated with the high probabilities and convert it into a discret classification I mentioned this yesterday but it Bears saying again computer vision is hard we once again take it for granted as human beings we're really good at dealing with all of these problems there's Viewpoint variation the object looks totally different in terms of the numbers of the behind images in terms of the pixels when viewed from a different angle Viewpoint variation objects when you standing far away from them or up close are totally different size we're good at detecting that they are different size it's still the same object as human beings but that's still a really hard problem because those sizes can vary drastically we talked about occlusions and deformations with cats well understood problem there's background clutter you have to separate the object of interest from the background and given the three-dimensional structure of our world there's a lot of stuff often going on in the background the Clutter their int class variation that's often greater than interclass variation meaning objects of the same type often have more variation than the objects that you're trying to uh separate them from there's the hard one for driving illumination light is the way we perceive things the reflection of light off the surface and the source of that light changes the way that object appears and we have to be robust all of that so the image classification pipeline is the same as I mentioned there's categories it's a classification problem so there's categories of cat dog mug hat and you have a bunch of examples image examples of each of those categories and so the input is just those images paired with the category and you train to map to to estimate a function that maps from the images to the categories for all of that you need data a lot of it there is unfortunately there's a growing number of data sets but they're still relatively small we get excited there are millions of images but they're not billions or trillions of images and these are the data sets that you will see if you read academic literature most often mnist the one that's been beaten to death and then we'll use as well in this course is uh a data set of handwritten digits where the categories are zero to to nine imag net one of the largest image data sets fully labeled image data sets in the world has images with a hierarchy of categories from wet and what you see there is a labeling of what images associated with which words are present in the data set Carr 10 and carr1 100 are tiny images that are used to prove in a very efficient and quick way offand that your algorithm that you're trying to publish on or trying to impress the world with works well it's small it's a small data set C far 10 means there's 10 categories and places is a data set of natural scenes Woods Nature City so on so let's look at Sear 10 is a data set of 10 categories airplane automobile bird cat so on they're shown there with sample images as the rows and so let's build a classifier that's able to take images from one of these 10 categories and tell us what is uh shown in the image so how do we do that once again the all the algorithm sees is numbers so we have to try to have at the very core we have to have an operator for comparing two images if given an image and I want to say if it's a cat or a dog I want to compare it to images of cats and compare it to images of dogs and see which one matches better so there has to be a comparative operator okay so one way to do that is take the absolute difference between the two images pixel by pixel take the difference between uh each individual pixel shown on the bottom of the slide for a 4x4 image and then we sum that pixel wi pixel wise absolute difference into a single number so if the image is totally different pixel wise that'll be a high number if it's the same image the number will be zero oh it's the absolute value too of the difference that's called L1 distance doesn't matter when we speak of distance we usually mean L2 distance and so if we try to so we can build a classifier that just uses this operator to compare it to every single image in the data set and say I'm going to pick the I'm going to pick the category that's the closest using this comparative operator I'm going to find I have a picture of a cat and I'm going to look through the data set and find the image that's the closest to this picture and say that is the category that this picture belongs to so if we just flipped a coin and randomly picked which category and image belongs to get that accuracy would be on average 10% it's random the accuracy we achieve with our brilliant image difference algorithm that just goes go to the data set and finds the closest one is 38% which is pretty good it's way above 10% so you can think about this operation of looking through the data set and finding the closest image as what's called Ker's neighbors where K in that case is one meaning you find the one closest neighbor to this image that you're asking a question about and accept the label from that image you could do the same thing increasing k k increasing K to two means you take the two nearest neighbors you find the two closest in terms of pixel wise image difference images to this particular query image and find which category do those belong to what's shown up top on the left is the data uh the data set we're working with red green blue what's shown in the middle is the one nearest neighbor classifier meaning this is how you segment the entire space of different things that you can compare and if a point falls into any of these regions it will be immediately associated with the nearest neighbor algorithm to belong to that image to to that region with uh F neighbors there's IM immediately an issue the issue is that there's white regions there's tide Breakers where your five closest neighbors are from various categories so it's unclear what you belong to so if we this is a good example of parameter tuning you have one parameter K and you have to your task as a machine as a teacher machine learning you have to teach this algorithm how to do your learning for you is to figure out that parameter that's called parameter tuning or hyperparameter tuning as it's called in neural networks and so on the bottom right of the slide is on the x axis is K as we increase it from zero to 100 and on the y- AIS is classification accuracy it turns out that that the best k for this data set is seven seven years neighbors with that we get a performance of 30% human level performance and I should say that the way that number is we get that number as we do with a lot of the machine learning pipeline process is you separate the data into the parts of data set you use for training and SE and another part they use for testing you're not allowed to touch the testing part that's cheating you construct your model of the world on the training data set and you use what's called cross validation where you take a small part of the training data shown fold five there in yellow to uh leave that part out from the training and then use it as part of the hyperparameter tuning as you train figure out with that yellow part fold five how well you're doing and then you choose a different fold and see how well you're doing and keep playing with parameters never touching the test part and when you're ready you run the algorithm on the test data to see how well you really do how well it really generalizes yes question is there any way to DET an inition what or do you just have to R so the question was is there a good way to is there any good intuition behind what a good K is there's general rules for different data sets but usually you just have to run through it grid search Brute Force yes question good good question yes yeah yes the question was is each pixel one number or three numbers for majority of computer vision throughout it history use grayscale images so was one number but RGB is three numbers and there's sometimes a depth value too so it's four numbers so it's if you have a stereovision camera that gives you the depth information of the pixels that's a fourth and then if you stack two images together it could be six in general everything we work with will be three numbers for a pixel it was yes so the question was for the absolute value is just one number exactly right so in that case it was grayscale images so it's not RGB images so that's you know this algorithm is pretty good if we use the best we hyper uh we optimize the hyper parameters of this algorithm choose K of7 seems to work well for this particular CFR 10 data set okay we get 30% accuracy it's impressive higher than 10% human beings perform at about 94 slightly above 94% accuracy for carr10 so given an image it's a tiny image I should clarify it's like a it's a little icon given that image human being is able to determine accurately one of the 10 categories with 94% accuracy and the currently state-of-the-art convolution neural network is 95 it's 95.4% accuracy and believe it or not it's a heated battle but the most important the critical fact here is it's recently surpassed humans and certainly surpassed the ker neighbors algorithm so how does this work let's briefly look back it's all it all still boils down to this little guy the the neuron that some sums the weights of its inputs adds a bias produces an output based on an activation a smooth activation function yes question sorry the question was you take a picture of a cat so you know it's a cat but that's not encoded anywhere like you have to write that down somewhere so you have you have to write as a caption this is is my cat and then the unfortunate thing given the internet and how witty it is you can't trust the captions and images cuz maybe you're just being clever and it's not a cat at all it's a dog dressed as a cat yes question uh sorry CNS do better than what yeah uh so the question was do com neural networks generally do better than nearest neighbors there's very few problems on which neural networks don't do better yes they almost always do better except when you have almost no data so you need data and convolutional neural networks isn't some special magical thing it's just neural networks with some with some cheating up front that I'll explain some tricks to try to reduce the size and make it capable to deal with images so again yeah the input is in this case that we looked at classifying an image of a number as opposed to doing some fancy convolutional tricks we just take the the entire 28x 28 pixel image that's [Music] 784 pixels as the input that's 784 neurons and the input 15 uh neurons on the hidden layer and 10 neurons in the output now everything we'll talk about has the same exact structure nothing fancy there is a forward pass through the network where you take an input image and produce an output classification and there's a backward pass through the network through back propagation where you adjust the weights when your prediction doesn't match the ground truth output and learning just boils down to op optimization it's just optimizing a smooth function differentiable function that's defined as the loss function that's usually as simple as a squared error between the true output and the one you actually got so what's the difference what are convolution neural networks convolution neural networks take take inputs that have some spatial consistency have some meaning to the spatial have some spatial meaning in them like images there's other is other things you can think of the dimension of time and you can input audio signal into a convolution your network and so the input is usually for every single layer that's a convolutional layer the input is a 3D Volume and the output is a 3D Volume I'm simplifying because you can call it 4D too but it's 3D there's height width and depth so that's an image the height and the width is the width and the height of the image and then the depth for grayscale image is one for an RGB image is three for a 10f frame video of grayscale images the depth is 10 it's just a volume a threedimensional matrix of numbers and every the only thing that a convolutional layer does is take a 3D Volume as input produce a 3D Volume as output and has some smooth function operating on the inputs on the sum of the inputs that may or may not be a parameter that you tune that you try to optimize that's it it's a Lego pieces that you stack together in the same way as we talked about before so what are the types of layers that a convolution your network have there's inputs so for example a color image of 32x 32 will be a volume of 32x 32x 3 the convolutional layer takes advantage of the space relationships of the input neurons and a convolutional layer it's the same exact neuron as for a fully connected Network the regular Network we talked about before but it's just has a narrower receptive field it's more focused the uh the inputs to a neuron on the convolutional layer come from a specific region from the previous layer and the parameters on each field fil you could think of this as a filter because you slide it across the entire image and those parameters are shared so as opposed to taking the for if you think about two layers as opposed to connecting every single Pixel in the in the first layer to every single neuron in the following layer you only connect the neurons and the input layer that are close to each other to the output layer and then you enforce the weights to be tied together spatially and what that results in is a filter every single layer on the output you could think of as a filter that gets excited for example for an edge and it when it sees this particular kind of edge in the image it'll get excited it'll get excited in the top left of the image top right top left uh bottom left bottom right the Assumption there is that an a powerful feature for detecting a cat is just as important no matter where in the image it is and this allows you to cut away a huge number of connections between neurons but it still boils down on the right as a neuron that sums a collection of inputs and applies weights to them the tunable the spatial arrangement of the output volume volume relative to the input volume is controlled by three things the number of filters so for every single quote unquote filter you'll get an extra layer on the output so if the the the input let's talk about the very first layer the input is 32x 32x 3 it's an RGB image of 32x 32 if the number of filters is 10 then the resulting depth the resulting number of stacked channels in the output will be 10 stride is given is the step size of the filter that you slide along the image often times that's just one or three and that directly reduces the size the spatial size the width and the height of the output image and then there is a convenient thing that's often done is patting the image on the outsides with zeros so that the input and the output have the same height and width so this is a visualization of convolution and I encourage you to kind of maybe offline think about what's happening it's similar to the way human Vision Works crudely so if there's any experts in the audience so the input here on the left is a collection of numbers 0 1 2 and a filter well uh the uh there is two filters shown as W1 and W w0 and W1 those filters uh shown in red are the different weights applied on those filters and each of the filters have a c a depth just like the input a depth of three so there's three of them in each column and so let's see if you can yeah and so you slide that filter along the image keeping the weights the same this is the sharing of the weights and so your first filter you pick the weights this is an optimization problem you pick the weights in such a way that it fires it gets excited a useful features and doesn't fire for not useful features and then this is a second filter that fires for useful features and not and produces an uh a signal on the output depending on a positive number meaning there's a strong feature in that region and a negative number if there isn't but the filter is the same this allows for drastic reduction in the parameters and so you can deal with inputs that are a thousand by thousand pixel image for example or video that there's a really part powerful concept there the spatial the spatial sharing of Weights that means there's a spatial invariance to the features you're detecting it allows you to learn from arbitrary images it can you so you don't have to be concerned about pre-processing the images in some clever way you just give the raw image there's another operation pooling it's a way to reduce the size of the layers by uh for example in this case Max pooling for taking a a collection of outputs and choose next one and summarizing those collection of pixels such that the output of the pooling operation is much smaller than the input because the justification there is that you don't need a high resolution uh localization of exactly where which pixel got uh is important in the image according to you know you don't need to know exactly which pixel is associated with the cat ear or you know a cat face as long as you kind of know it's around that part and that reduces a lot of complexity in the in the operations yes question the question was when uh when is too much pooling when do you stop pooling so so pooling is a very crude operation that doesn't have any one thing you need to know is it doesn't have any uh parameters that are learnable so you can't learn anything clever about the about pooling you're just picking in this case Max pool so you're picking being the largest number so you're reducing the resolution you're losing a lot of information there's an argument that you're not you know losing that much information as long as you're not pulling the entire image into a single value but you're gaining training uh efficiency you're gaining the memory size the you're reducing the size of the network so it's it's definitely a thing that people debate and some it's a par that you play with to see what works for you okay so how does this thing look like as a whole a convolutional your own network the input is an image there's usually a convolutional layer there is a pooling operation another convolutional layer another pooling operation and so on at the very end if the task is classification you have a stack of convolutional layers and pooling layers there are several fully connected layers so you go from those the spatial convolutional operations to fully connecting every single neuron in a layer to the following layer and you do this so that by the end you have a collection of neurons each one is associated with a particular class so in what we looked at yesterday as the input is an image of a number 0 through 9 the output here would be 10 neurons so you boil down that image with a with a collection of convolutional convolutional layers with one or two or three fully connected layers at the end that all lead to 10 neurons and each of those neurons job is to give get fired up when it sees a particular number and for the other ones to produce a low probability and so this kind of process is how you have the 95 percentile accuracy on the cfr1 problem this here is imag net data set that I mentioned is how you take this image of a leopard of a container ship and produce a probability that that is a container ship or a leopard also shown there are the outputs of the the other nearest nearest neurons in terms of their confidence now you can use the same exact operation by chopping off the fully connected layer at the end and as opposed to mapping from image to a prediction of what's contained in the image you map from the image to another image and you can train that image to be one that gets excited spatially meaning it uh gives you a high close to one value for areas of the image that contain the object of interest and then a low number for areas of the image that are unlikely to contain that image and so from this you can go on the left an original image of a woman on a horse to a segmented image of knowing where the woman is and where the horse is and where the background is the same process can be done for detecting the object so you can segment the scene into a bunch of interesting objects candidates for interesting objects and then go through those candidates one by one and perform the same kind of classification as in the previous step where just an input is an image and the output is a classification and through this process of hopping around an image you can figure out exactly where is the best way to segment the cow out of the image it's called object detection okay so how can they how can these magical convolution neural networks help us in driving this is a video of the forward roadway from a data set that we'll look at that we've collected from a Tesla but first let me look at driving briefly the the general driving task from the human perspective on average an American driver in the United States drives 10 point 10,000 miles a year a little more for Rural a little less for urban there is about 30,000 fatal crashes and 32 plus sometimes as high as 38,000 fatalities a year this includes car occupants pedestrians bicyclists and motorcycle riders this may be a surprising fact but in a class uh on self-driving cars we should remember that uh so ignore the 59.9% that that's other the most popular cars in the United States are pickup trucks Ford F1 series Chevy Silverado Ram it's an important point that we're still married to our to uh wanting to be in control and so one of the interesting cars that we look at and the car C that is the data set that we provide to the class is is collected from is a Tesla it's the one that comes at the intersection of the Ford F-150 and the cute little Google self-driving car on the right it's fast it allows you to have a feeling of control but it can also drive itself hundreds of miles on the highway if need be it allows you to press a button and the car takes over it's a fascinating tradeoff of transferring control from the human to the car it's a transfer of trust and it's a chance for us to study the psychology of human beings as they relate to machines at 60 plus miles an hour in case you're not aware a little summary of human beings uh we're distracted things would like to text use the smartphone watch videos groom talk to passengers eat drink texting 169 billion texts were sent in the US every single month in 2014 on average 5 Seconds our I spent off the road while texting 5 Seconds seconds that's the opportunity for automation to step in more than that there's what Nitsa refers to as the 4ds drunk drug distracted and drowsy each one of those opportunities for automation to step in drunk driving stands to benefit significantly from automation perhaps so the mind miles let's look at the miles the data there 3 trillion 3 million million 3 million million miles driven every year and Tesla autopilot our case study for this class and as human beings is driven on full autopilot mode so is driving by itself 300 million miles as of December and the fatalities for human controlled Vehicles is 190 million so about 30 plus thousand fatalities a year and currently in Tesla under Tesla autopilot there's one fatality there's a lot of ways you can tear that statistic apart but it's one to think about already perhaps automation results in safer driving the thing is we don't understand automation because we don't have the data we don't have the data on the forward roadway video we don't have the data on the driver and we just don't have that many cars on the road today that drive themselves so we need a lot of data we'll provide some of it to you in the class and as part of our research in MIT we're collecting huge amounts of it of cars driving themselves and what we collecting that data is how we get to understanding so talking about the the data and what we'll will be doing training our algorithms on here is a Tesla Model S model X we have instrumented 17 of them have collected over 5,000 hours and 70 ,000 miles and I'll talk about the cameras that we put in in them we're collecting video of the forward roadway this is a highlight of a trip from Boston to Florida of one of the people driving a Tesla what's also shown in blue is the amount of time that autopoll was engaged currently zero minutes and then it it grows and grows for prolonged periods of time so hundreds of miles people engage autopilot out of 1.3 billion miles driven in a Tesla 300 million or an autopilot you do the math whatever that is 25% so we are collecting data of the forward roadway of the driver we have two cameras on the driver what we're providing with a class is epics of time of the forward roadway for privacy considerations cameras used to record our your regular webcam the Workhorse of the computer vision Community the C920 and we have some special lenses on top of it now what's special about these webcams nothing that cost 70 bucks can be that good right the what's special about them is that they they do on board compression and allow you to collect huge amounts of data and use reasonably sized storage capacity to store that data and train your algorithms on so what on the self-driving side do we have to work with how do we build a self-driving car there is the sensors radar lar Vision Audio all looking outside helping you detect the objects in the external environment to localize yourself and so on and there's the sensors facing inside visible light camera audio again and infrared camera to help detect pupils so we can decompose the cell driving car task into four steps localization and answering where am I Scene understanding using the texture of the information of the scene around to interpret the identity of the different objects in the scene and the semantic meaning of those objects of their movement there's movement planning once you figured all that out find found all the pedestrians found all the other cars how do I navigate through this maze uh a clutter of objects in a safe and uh legal way and there's driver State how do I detect using video or other information video of the driver detect information about their emotional state or their distraction level yes question yes that's a real time figure from lar lar is the sensor that provides you the um the 3D Point cloud of the external scene so lar is a technology used by most folks working with cell driving cars to give you a strong ground truth of the objects it's probably the the best sensor we have for getting 3D information the least noisy 3D information about the external question so autopilot is always changing one of the most amazing things about uh this vehicle is that the updates to autopilot come in uh in the form of software so it's the amount of time it's available changes it's become more conservative with time but in this this is one of the earlier versions and it shows the second line in yellow it shows what how often the autopilot was available but not turned on so it was the total driving time was 10 hours autopilot was available 7 hours and was engaged an hour this particular person is uh a responsible driver cuz what you see or is more cautious driver uh what you see is it's raining autopilot is still available but the comment was that you shouldn't trust that one fatality number as an indication of safety because the drivers elect to only engage the system when they are uh when it's safe to do so it's a totally open uh there's a lot bigger Arguments for that number than just that one the the the the question is whether that's a bad thing so maybe we can trust human beings to engage You know despite the poorly filmed YouTube videos despite the hype in the media you're still a human being riding a 60 mes an hour in a metal box with your life in the line you won't engage the system unless you know it's completely safe unless you've built up a relationship with it it's not all the stuff you see where a person gets in the back back of a Tesla and starts sleeping or is playing chess or whatever that's all for YouTube the reality is when it's just you in the car it's still your life on the line and so you're going to do the responsible thing unless perhaps you're a teenager and so on but that never changes no matter what you're in so the question was what do you need to see or sense about the external environment to be able to successfully Drive do you need Lane markings do you need other what what are the land marks based on which you do the localization of the navigation and that depends on the sensors so a uh with a Google self-driving car in sunny California it depends on lidar to in a high resolution way map the environment in order to be able to localize itself based on lidar and lidar now I don't know the details of exactly where lar fails but it's not good with rain it's not good with snow it's not good when the environment is changing so what what snow does is it changes the visual the the appearance the reflective texture of the surfaces around us human beings are still able to figure stuff out but a car that's relying heavily on lar won't be able to localize itself using the landmarks it previously has detected because they look different now with snow computer vision can help us with Lanes or following a car the two the two landmarks that we use stay in the lane is following a car in front of you or staying between two lanes that's the nice thing about our roadways is they're designed for human eyes so you can use computer vision for L uh for lanes and for cars in front to follow them and there is radar that's a crude but reliable source of distance information that allows you to not collide with metal objects so all of that together depending on what you want to rely on more gives you a lot of information the question is when it's the messy complexity of of real life occurs uh how reliable will it be in the urban environment and so on so localization how can deep learning help so first first let's just quick summary of visual odometry it's using a monocular or stereo input of video images to determine your orientation in the world the orientation in this case of a vehicle to in the in the frame of the world and all you have to work with is a video of the forward worldway and with stereo you get a little extra information of how far away different objects are and so this is where one of our speakers on Friday will talk about his expertise slam simultaneous localization and mapping this is a very well studied and understood problem of detecting unique features in the external scene and localizing your s based on the trajectory of those those unique features when the number of features is high enough it becomes an optimization problem you know this particular Lane moved a little bit from frame to frame you can track that information and fuse everything together in order to be able to estimate your trajectory through the three-dimensional space you also have other sensors to help you out you have GPS which is pretty accurate not perfect but pretty accurate it's another signal to help you localize yourself you also have IMU accelerometer tells you the your acceleration the uh from the gyroscope the accelerometer you have the six degree of freedom of movement information about how the moving object the car is navigating through space so you can do the that the you could do using the old school way of optimization given a unique set of features like sift features and that step involves with stereo input undistorting and rectifying the images you have two images you have to from the two images compute the depth map so for every single Pixel Computing the your estimate best estimate of the depth of that pixel the three-dimensional position relative to the camera then you compute that's where you compute the disparity map that's what that's called the from which you get the distance then you detect unique interesting features in the scene sift is a popular one as a popular algorithm for detecting unique features and then you over time track those features and that tracking is what allows you to through the vision alone to get information about your trajectory through three-dimensional space you estimate that trajectory there's a lot of assumptions assumptions that bodies are rigid so you have to figure out if a large object passes right in front of you you have to figure out that that what that was you have to figure out the mobile objects in the scene and those that are stationary or you can cheat what we'll talk about and do it using neural networks end to end now what does end to end mean and this will come up a bunch of times throughout this class and today end to end means and I refer to it as cheating because it takes away a lot of the hard work of hand engineering features you take the raw input of whatever sensors in this case it's taking stereo input from a stereovision camera so two images a sequence of two images coming from a stereo vision camera and the output is a estimate of your trajectory through space so as opposed to doing the hard work of Slam of detecting unique features of localizing yourself of tracking those features and figuring out what your trajectory is you simply train the network with some ground truth that you have from a more accurate sensor like ldar and you train it on a set of inputs that are stereo Vision inputs and outputs is the trajectory to space you have a separate convolution neur networks for the velocity and for the orientation and this works pretty well unfortunately not quite well and uh John Leonard will talk about that slam is one of the places where deep learning has not been able to outperform the previous approaches where where deep learning really helps is the scene understanding part is interpreting the objects in the scene it's detecting the various parts of the scene segmenting them and with Optical flow determining their movement so previous approaches for detecting objects like the uh traffic uh signal classification of detection that we have the tensor flow tutorial for or to use use har like features or other types of features that are hand engineered uh from the images now we can use convolution networks to to replace the extraction of those features and there's a TENS of flow implementation of seget which is taking the exact same neural network that I talked about it's the same thing just it it's uh the the beauty is you just apply similar types of networks to different problems and uh depending on the complexity of the problem can get quite amazing uh performance in this case we convolutional in the network meaning the output is an image input is an image a single monocular image the output is a segmented image where the colors indicate your best pixel by pixel estimate of what object is in that part this has no is not using any spal information it's not using any temporal information so it's per uh it's processing every single frame separately and it's able to separate the road from the uh trees from the pedestrians other cars and so on this is intended to lie on top of uh a radar SL liar type of Technology that's giving you the three-dimensional or stereo Vision three-dimensional information about the scene you're sort of painting that scene with the identity of the objects that are in it your best estimate of it this is something I'll talk about tomorrow is recurring your Networks and we can use recurring real networks that work with temporal data to process video and also process audio in this case we can process what's shown on the bottom is a spectrogram of audio for a wet Road and a dry road you can look at that spectrogram as an image and process it in a temporal way using Rec curral networks just slide it across and keep feeding it to a network and it does incredibly well on the simple tasks certainly of dry Road versus Red Road this is an important a subtle but very important task and there's many like it to know that the road the the texture the quality the characteristics of the road wetness being a critical one when it's not raining but the road is still wet that information is very important okay so for movement planning the same kind of approach uh on on the right is work from one of our other speakers sires Kon the same approach we're using for the to solve traffic through friendly competition is the same that we can use for for what Chris geres does with his race cars for planning trajectories in highspeed movement along complex curves so we can solve that problem using optimization solve the control problem using optimization or we can use it with reinforcement learning by running tens of millions hundreds of millions of time through that simulation of taking that curve and learning which trajectory doesn't both optimizes the uh the speed at which you take the turn and the safety of the vehicle exactly the same thing that you're using for traffic and for driver State this is what we'll talk about next week is all the fun face stuff eyes face emotion this is we have video of the driver video of the driver's body video of the driver's face on the left is one of the Tas in his younger days still looks the same there he is so uh that's uh in that particular case you're doing one of the easier problems which is one of detecting where the head and the eyes are positioned the head and eye pose in order to determine What's called the gaze of the driver where the driver is looking glance and so shown and we'll talk about these problems from the left to the right on the left and green are the easier problems on the red are the harder from the computer vision aspect so on the left is body pose head pose the larger the object the easier it is to detect and the orientation of it is the easier to detect and then there is pupil diameter detecting the pupil the characteristics the position the size of the pupil and there's micro cods things that happen at 1 millisecond frequency the Tremors of the eye all important information for to determine the state of the driver some are possible with computer vision some are not this is something that we'll uh talk about I think on Thursday is the detection of where the driver is looking so with this is a bunch of the cameras that we have in a Tesla this is Dan driving a Tesla and detecting exactly where of one of six regions we've converted it into a classification problem of left right rear view mirror instrument cluster Center stack or forward roadway so we have to determine out of those six categories which is which direction is the driver looking at this is important for driving we don't care exactly the XYZ position of where the driver is looking at We Care that they're looking at the road or not are they look at their cell phone in their lap or they looking at the forward roadway and we'll be able to answer that pretty effectively using convolution your own networks you can also look at emotion using cnns to extract again converting emotion complex world of emotion into a binary problem of frustrated versus satisfied this is a video of drivers interacting with a voice navigation system if you've ever used one you know it may be a source of frustration from folks and so this is self-reported this is one of the hard you know driver emotion if you're in what's called Effective computing is the field of studying emotion from the computational side if you're if you know that if you're working in that field you know that The annotation side of emotion is a really challenging one so getting the ground truth of well okay so this guy is smiling so can I label that as happy or he's frowning does that mean he's sad most Effective computing folks do just that in this case with self-report ask people how frustrated they were in a scale of 1 to 10 Dan up top reported a uh a one so not frustrated he's satisfied with the interaction and the other driver reported it as a nine he was very frustrated with interaction now what you notice is there's a very cold stoic look on Dan's face which is an indication of happiness and in the case of frustration uh the driver is smiling so this is a sort of a good reminder that we can't trust our own human instincts and Engineering features and engineering the ground truth we have to trust the data trust the ground truth that we believe is the closest reflection of the actual semantics of what's going on in the scene okay so end to end driving getting to the the project and the tutorial so if driving is like a conversation and thank you for someone to clarifying that this is Arctic Triumph in in U in Paris this video If and if driving is like uh natural language conversation that we can think of end to-end driving as skipping the entire touring test components and treating it as an ENT natural language generation so what we do is we take as input the external sensors as and output the control of the vehicle and the magic happens in the middle we replace that entire step with a neural network the TA told me to not include this image because it's the cheesiest I've ever seen I apologize thank you thank you uh I regret nothing so this is this is to show our path to uh to self-driving cars but they it's to explain a point that we have a large data set of ground truth if we were to formulate the driving task of Simply taking external images and producing steering commands acceleration of braking commands then we have a lot of ground truth we have a large number of drivers on the road every day driving and therefore collecting our ground Truth for us because they're an interested party in producing the steering commands that keep them alive and therefore if we were to record that data it becomes ground truth so if it's possible to learn this what we can do is we can collect data for the manually control vehicles and use that data to train an algorithm to to to control a self-driving vehicle okay so one one of the first folks that did this is NVIDIA where they actually train in an external image the image of the forward roadway and a neural network a convolution neural network a simple vanilla convolution neural networks I'll briefly outline take an image in take a steering produce a steering command out and they're able to successfully to some degree learn to navigate basic turns curves and even stop or um make sharp turns at a at a uh T intersection so this this network is simple there is input on the bottom output up top the input is a 66x 200 pixel image RGB shown on the left or shown on the left as the raw input and then you crop it a little bit and resize it down 66x 200 that's what we have in the code as well the in the two versions of the code we provide for you both that runs in the browser and in tensor flow it has a few layers a few convolutional layers a few uh fully connected layers and an output this is a regression Network it's producing not a classif a of cat versus dog it's producing a steering command how do I turn the steering wheel that's it the rest is Magic and we train it on uh human input what we have here as a project is an implementation of this system in comnet JS that runs in your browser this is the tutorial tutorial to follow and the project to to take on so unlike the Deep traffic game this is reality this is a real uh real input from real Vehicles so you can go to this link demo went wonderfully yesterday so let's see maybe be two for two so there's a tutorial and then the actual game the actual simulation is on deep Tesla JS I apologize everyone's going there now aren't they does it work on a phone it does great again similar structure up top is the uh visualization of the loss function as the network is learn is learning and it's always training next is the input for the layout of the network there's the the specification of the input 200x 66 there's a convolutional layer there's a a pooling layer and the output is a regression layer a single neuron this is a tiny version deep tiny right is uh as a tiny version of the Nvidia uh architecture and then you can visualize the operation of this network on real video the actual wheel value that produced by the driver by the autopilot system is in blue and the output of the network is in White in what's indicated by Green is the cropping of the image that is then resized to produce the 66x 200 input to the network so once again amazingly this is running in your browser training on real world video so you can get in your car today input it and maybe teach and your'll network to drive like you we have the code in Comet JS and tensor flow to do that and a tutorial well let me briefly describe some of the um some of the work here so the input to the network is a single image this is for for deep Tesla JS single image the output is a steering wheel value between -20 and 20 that's in degrees we record like I said uh thousands of hours but we provide publicly 10 video clips of highway driving from a Tesla half are driven by autopilot half were driven by human the wheel values extracted from a perfectly synchronized can we are collecting all of the messages from can which contains steering wheel value and that's synchronized with the video we crop extract the window the green one I mentioned and then provide that as the input to the network so this is a slight difference from Deep traffic with the red car weaving through traffic because there's the messy reality of real world lighting conditions and your task for the most part in this simple steering task is to stay inside the lane and say the lane markings in an endtoend way learn to do just that so Comet JS is a JavaScript implementation of cnns of convolution networks it supports really arbitrary Network I mean all neural networks are simple but because it runs in JavaScript it's not utilizing GPU the larger the network the more it's going to uh be weighed down uh computationally now unlik the previous unlike deep traffic this isn't a competition but if you are a student registered for the course you still do have to submit the code you still have to submit your own your own car uh as part of the class hey the question so uh the question was the amount of data that's needed is there a general rules of thumb for the amount of data needed for a particular task in driving for example it's a good question you generally have to like I said your networks of good memorizers so you have to just have every case represented in the training set that you're interested in as much as possible so that means in general if you want a picture if you want to classify difference in cats and dogs you want to have at least 1 th000 cats and 1 th000 dogs and then you do really well the problem with driving is twofold one is that most of the time driving looks the same and the stuff you really care about is when driving looks different it's all the edge cases so what we're not good with neural networks is generaliz in from the common case to the edge cases to the outliers so avoiding a crack just because you can stay in the highway for thousands of hours successfully doesn't mean you can avoid a crash when somebody runs in front of you on the road and the other part with driving is the accuracy you have to achieve is really high so for uh for C versus dog a you know a life doesn't depend on your error on your ability to steer a car inside of a lane uh you better be very close to 100% accurate there's a box for Designing the network there's a visualization of the metrics measuring the performance of the network as a trains there is a visualization a layer visualization of the what features the network is extracting at every convolutional layer and every fully connected layer there is ability to restart the the training visualize the visualize the network performing on real video there is the input layer the convolutional layers the video visualization an interesting tidbit on the bottom right is a uh barcode that will has ingeniously design how do I clearly explain why this is so cool it's a way to through video synchronize multiple streams of data together so it's very easy for those who have worked with multimodal data where there's several streams of data to for the for them to become unsynchronized especially when a big component of training and neural network is shuffling the data so you have to shuffle the data in clever ways so you're not overfitting any one little aspect of the video and yet maintain the data perfectly synchronized so what he did instead of doing the hard work of connecting the uh the D the the steering wheel and the and the video is actually putting the steering wheel on top of the video as a barcode the final result is you can watch the network operate and over time it learns more and more to steer correctly I'll fly through this a little bit in the interest of time just kind of summarize some of the things that you can play with in terms of tutorials and let you guys go this is the same kind of process ENT to end driving with tensor flow so we have code available on GitHub you just put up on um my GitHub under deep Tesla that takes in a single video or an arbitrary number of videos trains on them uh and produces the visualization that compares the steering wheel the actual steering wheel and the predicted steering wheel the steering wheel when it agrees with a human driver or with the autopilot system lighting up is green and when it disagrees lighting up is red hopefully not too often again this is some of the details of how that's exact done in tensor flow this is vanilla convolution neur networks specifying a bunch of layers convolutional layers a fully connected layer train the model so you iterate over the batches of images run the model over a test set of images and get this result we have a tutorial or IPython notebook in a tutorial up on this this is perhaps the best way to get started with convolution networks in terms of our class is looking at the simplest class image classification problem of traffic light classification so we have these images of traffic lights we did the hard work of detecting them for you so now you have to uh figure out you have to build a convolutional network it gets figures out the concept of color and gets excited when it sees red yellow or green if anyone has questions and welcome those you can stay after class if you have any um concerns with Docker with tensor flow with how to win traffic deep traffic just stay after class or come by Friday 5 to 7 see you guys tomorrow