Transcript
8WqNFDJFXxk • Beyond LLMs: The Rise of World Models and Spatial Intelligence
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/FoundationModelsForRobotics/.shards/text-0001.zst#text/0067_8WqNFDJFXxk.txt
Kind: captions Language: en Okay, so you've heard all the hype about AGI, right? Artificial general intelligence. And yeah, the headlines are all about these giant language models. But what if I told you that behind the scenes something way more fundamental is happening? The biggest AI labs are all starting to realize something huge. You can't learn true intelligence just from text. Nope. It has to be grounded in the real world. So today, we're going deep on what might just be the final critical piece of the AGI puzzle, world models. This is all about teaching machines to get physics, space, and cause and effect, not by reading a textbook, but by actually experiencing it. All right, so here's what we've got on tap for this deep dive. We're going to start with why AI has this major grounding problem. Then we'll nail down what exactly a world model is. After that, we'll get into the two competing ideas on how to actually build one. Then, we'll check out the big players in the game and what they're up to. We'll also see how this whole thing is scaling up from a tiny sandbox to a full-on crystal ball for the whole planet. And finally, we'll talk about the massive data gold rush that's driving all of this. Okay, let's kick things off. To really get this, you have to understand a massive limitation in the AI we have today. You know, the big goal for a lot of people is AGI, an AI that can think, plan, and act in the world just like a person. But to pull that off, it needs a deep, almost gut level understanding of the physical world. Here's the problem. An AI trained only on text, like a large language model, well, it can read the word gravity. It can even spit out Newton's laws of motion. But it has absolutely no idea what it feels like for an apple to fall from a tree. It's learned the word, but not the reality behind it. And this huge disconnect, this gap between knowing the symbol and understanding the substance. That's what we call AI's grounding problem. And this quote from the AI pioneer Dr. Feay Lee, it just hits the nail on the head. She says, "LMs are eloquent but inexperienced, knowledgeable but ungrounded." I mean, think about that. They've totally mastered our language, our art, all our abstract ideas, but they're just words in the dark. It's like this. An LLM can scan 10,000 recipes and write you the most perfect, mouthwatering description of a sule you've ever read. It knows all the ingredients, the temperatures, the chemistry, but it has zero clue how to actually crack an egg or whisk it just right or feel the heat coming off an oven. See, it has all the knowledge but none of the experience. That's the grounding problem in a nutshell. This really gets to the heart of it, right? LLMs are just masters of symbols, of words. But when you talk about spatial reasoning, that intuitive physics we do every single second without even thinking about it, they just fall flat. They're, to use a term from the internet, total word cells. But they're not shape rotators. They can guess what a 3D object is, but they don't truly gro it. you know, they don't get its three-dimensionality. Now, think about how we learn from the second we're born. We're basically little scientists. As a toddler, you don't learn about gravity from some book. Nope. You learn by dropping your spoon off your high chair over and over and over again and watching it fall. You learn about momentum and friction by taking those first wobbly steps and falling down. Our entire intelligence is built on this foundation of what we see, hear, and touch. It's that deep physical understanding of the world that today's AI is completely missing. So, that leads us to the billiondoll question that's driving pretty much the entire AI industry right now. How do we fix this? How do we finally give AI that true spatial intelligence? And the answer everyone is chasing from Google and OpenAI to Nvidia and Runway is the world model. All right, section two. So, what exactly is a world model? And I want to be clear here, this is not just another piece of marketing jargon. It's a very specific and incredibly powerful idea for building the engine of spatial intelligence that AI so desperately needs. Okay, at its heart, a world model is basically an AI's own internal simulation of the world. Think of it like its imagination. It's like giving the AI its own private video game engine. But, and this is the critical part, in a game like Grand Theft Auto, human developers have spent years hand coding all the physics, right? They write lines of code that say if a car hits a wall this fast, it should crumple like this. A world model does the complete opposite. It's not given any rules. It has to figure out the rules for itself just by watching our world through tons and tons of video data. It learns that glass shatters, water splashes, and balls bounce. Not because someone programmed it to know that, but because it's seen it happen millions of times. Now, building on Dr. Fail's foundational research. A true world model has to have three key things going for it. First, it has to be generative. This means it can't just recognize the world. It has to be able to create new, believable scenes from scratch. And those scenes have to obey the laws of physics. So, if it generates a glass falling off a table, that glass better fall down, not up. And it needs to shatter in a realistic way. Second, it has to be natively multimodal. This is huge. It needs to seamlessly blend different types of data, video, audio, text, even 3D maps just like we do. This is a total gamecher for robotics. You need to be able to give a robot a 2D map and a simple command like go to the kitchen and have it figure out how to turn that into physical actions. And third, it's got to be interactive, a static 3D picture of a city. That's not a world model. It needs to be a living simulation. The model has to understand how to simulate cars driving around, the weather changing, and people interacting with each other in that space. And this this is the simple yet brilliant idea at the core of how these models actually learn. The comparison to LLMs is perfect. A model like GPD4 got so good at language by doing one simple thing over and over. Predict the next word. Well, world models do the exact same thing, but for reality itself. By forcing the model to get insanely good at predicting what's going to happen in the next few frames of a video, it has no choice but to learn the underlying physics of that reality. It has to learn that if something goes behind a pillar, it doesn't just disappear. That's object permanence. It has to learn that if you throw a ball, it's going to follow a certain arc. That's an intuitive grasp of gravity. All these complex physical ideas just emerge, not because they were programmed in, but because the model had to learn them to get good at its one job, predict what happens next. Okay, so this is where it gets really interesting. While pretty much everyone agrees that world models are the future, there's a huge and fascinating debate on the best way to actually build one. And this isn't just some nerdy technical squabble. It's a deep philosophical split about what a simulation even is. You can basically break the whole field into two camp. So on one side, you've got the explicit 3D folks. Their goal is to take video or images and create a mathematically perfect tangible 3D asset. you know, like a polygon mesh you'd see in a video game or something newer like a Gaussian splat, which is basically a cloud of smart colored dots in space. The point is the final product is a real 3D object you can drop into pro tools like Blender or Nvidia's Omniverse. Then you have the pixel generation camp, and their philosophy is completely different. They basically argue, why bother with all that hard work of creating a perfect 3D model? To the AI, it's all just pixels anyway. So, their goal is to skip that 3D step completely and just focus on generating the next believable frame of a video. It's really the difference between being a digital sculptor and being a neural network that's acting like a real-time movie director. So, let's dig into where this explicit 3D approach really shines. Its biggest advantage is precision and control. For instance, if you're Nvidia trying to train a robot to pick up an apple, you need to simulate that scene with absolute accuracy. The robot has to know the exact 3D coordinates, the weight, the friction of that apple. A precise 3D model gives you that. Same thing in Hollywood for virtual production where you've got a real actor standing in front of a giant LED screen. That digital background has to have a correct and stable 3D geometry so the camera can move around it naturally. This whole approach is about plugging into existing professional pipelines that need that kind of mathematical perfection. Okay. Now, for the pixel generation camp, their superpower is scalability. And that scalability comes from the data they can use. See, there's only so much clean, labeled 3D data out there, but there are trillions of hours of messy, unstructured video on the internet. And this approach can learn from all of it. Take training an AI for a retail simulation. With the 3D approach, you'd have to pay expensive 3D artists to painstakingly model and animate an angry customer character. With the pixel approach, you just train a model on thousands of hours of real footage of customer freakouts and generate that scenario whenever you want. This makes the simulation infinitely more varied, way cheaper, and it can capture all those subtle human behaviors that are almost impossible to animate by hand. By the way, if you're finding this breakdown helpful, make sure you hit subscribe so you don't miss our future deep dives. All right, keeping those two big ideas in mind, we can now start to map out who's doing what. Let's take a look at the key players in this race to simulate reality because the path they've chosen really tells you everything about what they're trying to achieve. Luma AI is the perfect case study here. They first got famous in the explicit 3D world with their amazing work on something called neural radiance fields or nerfs. A nerf is basically a smart neural network that can create a full 3D scene from just a handful of 2D pictures. But even though they were killing it, Luma recently made a huge pivot. They now believe that direct video generation, the pixel approach, is the far more scalable path to AGI. And their logic is simple but powerful. The amount of 3D data in the world is a tiny puddle, but the amount of video data is a massive ocean. For a leader in the 3D space to make a move like that, that's a huge signal to the rest of the industry about where things are probably headed. Then you've got Runway ML, who are definitely one of the leaders in the pixel generation camp. Their grand vision is straight out of Star Trek. They want to give every creator a personal hollow deck. They're building what's called an auto reggressive model, which sounds complicated, but it just means the model generates one frame of video. Then it looks at the frame it just made and uses that to generate the next one and the one after that. It creates this continuous interactive experience. The analogy is perfect. It's like streaming a video game, but there's no game engine like Unreal running on the server. It's just a giant neural network dreaming up your reality in real time, all based on your prompts. Their focus is all about getting this into the hands of creators and entertainers. And then there is Tesla, which is a really fascinating example of a hybrid approach that's solving a missionritical problem today. For their self-driving simulations, they start with a solid explicit 3D foundation. They use techniques like Gaussian splatting to build a precise 3D model of a road from their car's camera footage. That gives them a stable, geometrically correct stage, but then they use generative pixel level AI to be the director of the scene. They can take a real video of a drive on a sunny day and tell the AI, "Okay, run it again, but make it a blizzard." Or they can add a virtual pedestrian that suddenly steps into the road, or even simulate dangerous crashes you could never stage in real life. It really gives them the best of both worlds, the precision of 3D and the infinite variety of generative AI. So, let's just take a beat and sum up this whole competitive landscape. You've got players like Runway ML and Google with its genie model, and they are all in on pixel generation, focusing on content and interactive worlds. Then you have Luma, who made that famous pivot away from 3D to pixels, all with the grand goal of building AGI. In the hybrid corner, you have Tesla taking a super practical approach using both methods to solve the incredibly hard problem of self-driving cars. And then you have some really interesting research coming out of places like Bite Dance where they're trying to teach our model to watch a 2D video of a person walking and figure out the 3D path of their arms and legs. I mean, think about how valuable that could be for robotic. Okay, now let's zoom out way out. So far, we've been talking about simulations on a pretty small scale. You know, a single kitchen, a city block. But the real ambition here, it doesn't stop at the city limits. The ultimate vision is to scale these world models up to simulate the entire planet. And this isn't science fiction. This escalation of scale is happening as we speak. Right now, a lot of the cutting edge stuff is at the micro level, simulating one robot in one room. The near future, which companies like Tesla are already deep into, is scaling up to complex systems like an entire city for self-driving cars. The next logical step, which we're seeing from giants like Google with projects like Alpha Earth, is to go beyond just man-made stuff and simulate global natural systems. And that leads to the ultimate goal, the holy grail of this whole field, to go from just simulation to actual prediction. To build a model so good it can not only copy our world but forecast its future. And Nvidia's Earth 2 project is maybe the most mind-blowing example of this ambition. They are literally building a digital twin of planet Earth. The idea is to create a planetary scale world model and just pour in every bit of data we have. satellite images, weather station data, ocean sensor readings, everything to create a true crystal ball for our climate. Just imagine being able to predict a hurricane's path with perfect accuracy weeks in advance, or modeling the exact impact of a new policy on deforestation in the Amazon, or helping farmers predict crop yields as the climate changes. This takes simulation way beyond robotics and into the realm of planetary management. The implications for science and the economy are just staggering. So, okay, the vision is clear. The ambition is huge and the computing power from companies like Nvidia is getting more powerful by the day. So what's the holdup? What's the final bottleneck? What is the one thing stopping us from building these incredible digital realities tomorrow? Well, the answer is simple and it's the oldest problem in machine learning. You've got to get the right data. And this chart, it shows the problem perfectly. For those big planetary scale models, we are practically drowning in data from thousands of satellites. Great. For the general pixel generation models, we have the huge but messy ocean of video on places like YouTube. Okay. But when you get to the data you need to train an AI that has a body, like a humanoid robot that needs to walk around our world, the well is almost bone dry. We are desperately short on what's called egocentric data or firsterson point of view data. And this data is so valuable because it doesn't just show what the world looks like. It shows how an agent's own actions, the movement of their hands, their head, changes what they see. It is the data of interaction, and without it, a robot can never really learn how to act. This shortage of data has basically started a modern-day gold rush. You have these massive corporate projects like Meta's Project Arya, which plans to use AR glasses to capture first-person data from millions of people all over the world. Then you've got scrappy startups literally paying people to wear cameras on their heads and just do everyday stuff like cook dinner or clean the house all just to get this priceless training data. The whole industry knows that a giant diverse data set of firstperson interaction is the key to unlocking embodied AI and the race is on to see who can build the biggest and the best. And that brings us to our final and honestly a pretty profound question about where all this is headed. The team at Luma AI put it perfectly in a recent blog post. Reality is the data set for AGI. I mean, just think about that. If that's true, if the only way to build true artificial intelligence is by building the most accurate simulation of our world, then the defining question of the next decade is this. Who is going to own the best copy of reality? The game is no longer just about who has the smartest algorithm or the most computer chips. It's a race to capture, own, and train on the most complete data set of our shared physical world. I really hope this deep dive gave you a much clearer picture of this incredible race to build world models. If you want to keep up with the technologies that are literally building our future, you know what to do. Hit that subscribe button for more explainers that break down the cutting edge of tech. Thanks for watching.