The Cognitive Architecture of Future AI: From LLMs to Multimodal Embodied Systems
QyKSefEvEK8 • 2025-12-13
Transcript preview
Open
Kind: captions Language: en Hey everyone and welcome. Today we're diving into something truly mind-bending. How AI is making the incredible leap from just being an expert with words to becoming an agent that can actually see, understand, and act in our physical world. So, let's kick things off with a pretty fascinating question. We've all seen AI do incredible things, right? Write stories, generate code. But why can that same super smart AI write a beautiful poem about a cup, but it can't do something as simple as just pick one up? Well, the answer to that question is the key to understanding the next huge leap for AI. Okay, so to get to the bottom of that, we have to start with what we all know, right? The world of large language models or LLMs. You see, the real problem with these textonly AIs boils down to something called the symbol grounding problem. An LLM knows the word cup because it's seen it in billions of sentences online. It knows all the words that go with cup, but it has absolutely no idea what a cup is in the real world. It doesn't know its shape, its weight, or that you can't just stick your fingers through it. It's just shuffling symbols around without any real world connection. It doesn't get it. And that brings us to this really powerful way of thinking about it. LLMs are basically like a brain in a vat. They have this gigantic universe of information stored inside, but it's completely cut off from physical reality. This is exactly why they can hallucinate and just make stuff up that sounds plausible because there's no reality check. It can't look out the window and see if what it's saying actually makes any sense. So, how do you get the brain out of the vat? The first step, you've got to give it senses. And that brings us to the next stage in this evolution, large multimodel models or LMMs. Okay, take a look at this chart. This is going to be our roadmap for the whole journey. We're going to use this to track how AI is evolving, moving from left to right, from those basic LLMs to the really advanced stuff that's coming. Now, let's zoom in on the first two columns here. See that line for input modalities? LLMs are text only, but look at LLMs. They can handle text, vision, and audio. Giving AI eyes and ears, I mean, that's a total gamecher for connecting it to the real world. Its understanding is suddenly grounded in what it can actually perceive. And this is where things get really cool. Google's RT2 model was a massive breakthrough because for the very first time, a robot could tap into that huge library of knowledge on the internet, all those images, all that text, and use it to figure out how to do something new in the real world. And the results, I mean, they were staggering. RT2 was nearly three times better at performing tasks it had never ever been trained on before. This wasn't just a tiny improvement. It was a massive leap in its ability to generalize and figure out new stuff on its own. All thanks to that new multimodal understanding. So, okay, we've given the AI senses, but that's not the whole story. To act intelligently, it needs a better way to well to think. And this brings us to a really fascinating idea. Building an AI that thinks a little more like we do. You know, the psychologist Daniel Conorman came up with this idea that we humans have two different ways of thinking. System one is our fast, intuitive gut reaction. You know, that split-second decision when you slam on the brakes. And system two, that's our slow, deliberate, logical thinking. That's your sit down and really think it through mode, like when you're working on a tough puzzle. The thing is, today's LLMs are almost pure system one. They are phenomenal pattern matchers, giving you a quick, almost instinctive answer. But, and this is a big butt, that's also why they can get things wrong or hallucinate. They're great at quick connections, but they fall apart when a problem needs slow, careful, logical steps. So, the future, the real goal, is to build an AI that has both. See how this diagram lays it out? You've got that fast, reactive system one on one side and the slow, deliberate system two on the other. And the secret sauce is right there in the middle, that integration point. That's what lets the AI be both quick and thoughtful to react instantly when it needs to, but also to pause, plan, and reason when it hits a complex problem. So, we have an AI with senses. We have one with a more sophisticated way of thinking. What's the final piece of the puzzle? Well, it's giving that brain a body so it can finally get out and do things in the world. All right, let's go back to our road map here and look at that last column, embodied AI. Check out the real world grounding. It says strong, grounded in physical interaction. This right here is the final stage. This is where it all comes together. The senses, the smarter thinking, and now physical action. And this is all possible because of a brand new type of technology called vision language action models or VALAs. Think about it. Instead of cobbling together different systems for seeing, thinking, and moving, a VLA bundles it all into one seamless model. It can literally see a scene, understand a command like, "Hey, pick up the red apple," and then translate that directly into the right physical movements to get it done. And this isn't science fiction. We're seeing it happen right now. You've got Nvidia's GRO project, which is trying to build a general purpose AI for all kinds of humanoid robots. You've got Tesla pushing forward with its Optimus robot. And then you have incredible research like Dex Mimic Genen, which lets robots learn how to do really complex two-handed jobs just by watching a person do it one time. So when you put it all together, perception, cognition, action, you realize we are stepping into a brand new frontier. But you also realize that with this kind of power comes some seriously profound new responsibilities. I mean, the ultimate dream here is for AIs to develop what are called emergent capabilities. It's like how a child learns to walk and then from that figures out how to run and jump on their own. These embodied AIs could start picking up new skills just by interacting with the world, learning and growing in ways we didn't explicitly program. It's truly unpredictable and honestly a little mind-blowing. But to get to that future, we have to tackle some of the biggest questions humanity has ever faced. Like who's in charge of this stuff? Who governs it? How do we guarantee that a robot acting in our world actually shares our values? And as these AIs get more and more complex, what kind of ethical duties might we have toward them? You know, when you boil it all down, it comes back to this one single absolutely crucial question. As we teach our machines to move beyond words and actually step into our world, the great challenge of our time will be making sure they act not just intelligently, but wisely and for the good of every single one of us.
Resume
Categories