Transcript
ABbDB6xri8o • Tesla AI Day Highlights | Lex Fridman
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0545_ABbDB6xri8o.txt
Kind: captions Language: en tesla ai day presented the most amazing real world ai and engineering effort i have ever seen in my life i wrote this and it meant it why was it amazing to me no not primarily because of the tesla bot it was amazing because i believe the autonomous driving task and the general real world robotics perception or planning task is a lot harder than people generally think and i also believed the scale of effort in algorithm data annotation simulation inference compute and training compute required to solve these problems is something no one would be able to do in the near term yesterday was the first time i saw in one place just the kind and the scale of effort that is a chance to solve this the autonomous driving problem and the general real world robotics perception and planning problem this includes the neural network architecture and pipeline the autopilot compute hardware in the car dojo compute hardware for training the data and the annotation the simulation for rare edge cases and yes the generalized application of all of the above beyond the car robot to the humanoid form let's go through the big innovations the neural network each of these is a difficult and i would say brilliant design idea that is either a step or a leap forward from the state of the art in machine learning first is to predict the vector space not in image space this alone is a big leap beyond what is usually done in computer vision that usually operates in the image space in the two-dimensional image the thing about reality is that it happens out there in the three-dimensional world and it doesn't make sense to be doing all the machine learning on the 2d projections of it onto images like many good ideas this is an obvious one but a very difficult one second is the fusion of camera sensor data before the detections the detections performed by the different heads of the multitask neural network for now the fusion is at the multi-scale feature level again in retrospect an obvious but a very difficult engineering step of doing the detection and the machine learning on all of the sensors combined as opposed to doing them individually and combining only the decisions third is using video contacts to model not just vector space but time at each frame concatenating positional encodings multi-cam features and ego kinematics using a pretty cool spatial recurrent neural network architecture that forms a 2d grid around the car where each cell of the grid as a rnn recurrent neural network the other cool aspect of this is that you can then build a map in the space of rnn features and then perhaps do planning in that space which is a fascinating concept andre carpathi i think also mentioned some future improvements performing the fusion earlier and earlier in the neural network so currently the fusion of space and time are late in the network moving the fusion earlier on takes us uh further toward full end-to-end driving with multiple modalities seamlessly fusing integrating the multiple sources of sensory data finally the place where there's currently from my understanding of the least amount of utilization of neural networks is planning so obviously optimal planning in action space is intractable so that you have to come up with a bunch of heuristics you can do those manually or you could do those through learning so the idea that was presented is to use neural networks as heuristics in a similar way that neural networks were used as heuristics in the multicarlo tree search for mu 0 and alpha 0 to play different games to play go to play chess this allows you to significantly prune the search through action space for a plan that doesn't get stuck in the local optima and gets pretty close to the global optimum i really appreciated that the presentation didn't dumb anything down but maybe in all the technical details it was easy to miss just how much brilliant innovation that was here the move to predicting in vector space is truly brilliant of course you can only do that if you have the data and you have the annotation for it but just to take that step is already taking a step outside the box of the way things are currently done in computer vision then fusing seamlessly across many camera sensors incorporating time into the whole thing in a way that's differentiable with these spatial rnns and then of course using that beautiful mess of features both on the individual image side and the rnn side to make plans using neural network architecture as a heuristic i mean all of that is just brilliant the other critical part of making all of this work is the data and the data annotation first is the manual labeling so to make the neural networks that predict in vector space work you have to label in vector space so you have to create in-house tools and as it turns out tesla hired in-house team of annotators to use those tools to then perform the labeling vector space and then project it out into the image space first of all that saves a lot of work and second of all that means you're directly performing the annotation in the space in which you're doing the prediction obviously as was always the case as is the case with self-supervised learning auto labeling is the key to this whole thing one of the interesting thing that was presented is the use of clips of data that includes video imu gps odometry and so on for multiple vehicles at the same location and time to generate labels of uh both the static world and the moving objects and their kinematics that's really cool you have these little clips these buckets of data from different vehicles and they're kind of annotating each other you're registering them together to then combine a solid annotation of that particular part of road at that particular time that's amazing because the more the fleet grows the stronger that kind of auto labeling becomes and the more edge cases you're able to catch that way speaking of edge cases that's what tesla is using simulation for is to simulate rare edge cases that are not going to appear often in the data even when that data set grows incredibly large and also they're using it for annotation of ultra complex scenes where accurate labeling of real world data is basically impossible like a scene with like a hundred pedestrians which i think is the example they used so i honestly think the innovations on the neural network architecture and the data annotation is really just a big leap then there's the continued innovation on the autopilot computer side the neural network compiler that optimizes latency and so on there's uh i think i remember really nice testing and debugging tools for like variants of candidate trained neural networks to be deployed in the future where you can compare different neural networks together that's almost like developer tools for to be deployed neural networks and it was mentioned that uh almost 10 000 gpus are currently being used to continually retrain the network i forget what the number was but i think every week or every two weeks the network is fully retrained end to end the other really big innovation but unlike the neural network in the data annotation this is in the future so to be deployed still it's still under development is the dojo computer which is used for training so the autopilot computer is the computer on the car that's doing the inference and dojo computer is the thing that you would have in a data center that performs the training of the neural network there's a what they're calling a single training tile that is nine flops it's made up of d1 chips that are built in house by tesla each chip with super fast io each tile also with super fast io so you can basically connect an arbitrary number of these together each with the power supply and cooling and then i think they connected uh like a million nodes to have a compute center i forget what the name is but it's 1.1 xflop so combined with the fact that this can arbitrarily scale i think this is basically contending to be the world's most powerful neural network training computer again the entire picture that was presented on ai day is amazing because the what would you call it the tesla ai machine can improve arbitrarily through the iterative data engine process of auto labeling plus manual labeling of edge cases so like that labeling stage plus a data collection retraining deploying and again you go back to the data collection the labeling retraining and deploying and you can go through this loop as many times as you want to arbitrarily improve the performance of the network i still think nobody knows how difficult the autonomous driving problem is but i also think this loop does not have a ceiling i still think there's a big place for driver sensing i still think you have to solve the human robot interaction problem to make the experience more pleasant but damn it this loop of manual and auto labeling that leads to retraining at least the deployment goes back to the data collection and the auto labeling and the manual labeling is incredible second reason this whole effort is amazing is that dojo can essentially become an ai training as a service directly taking on aws and google cloud so there's no reason it needs to be utilized specifically for the autopilot computer the simplicity of the way they describe the deployment of pi torch across these nodes you can basically use it for any kind of machine learning problem especially one that requires scale finally the third reason all this was amazing is that the neural network architecture and data engine pipeline is applicable to much more than just roads and driving it can be used in the home in the factory and by robots basically any form as long as has cameras and actuators including yes the humanoid form as someone who loves robotics the presentation of a humanoid tesla bot was truly exciting of course for me personally the lifelong dream has been to build the mind the robot that becomes a friend and a companion to humans not just a servant that performs boring and dangerous tasks but to me these two problems should and i think will be solved in parallel the tesla bot if successful just might solve the latter problem of perception movement and object manipulation and i hope to play a small part in solving the former problem of human robot interaction and yes friendship i'm not going to mention love when talking about robots either way all of this to me paints a picture of an exciting future thanks for watching hope to see you next time you