Kind: captions Language: en All right, let's dive right in. Today we're talking about something that is well, it's all around us, but it's almost impossible to see. It's a force that basically shapes our entire physical world, but it has been completely out of reach for our most advanced AIs. You see, it's a kind of intelligence that's so baked into who we are that we don't even recognize it as intelligence. This is the story of the dark matter of robotics. To really get what we're talking about, I want you to just do a little thought experiment with me. Picture a bookshelf, maybe one in your own home, that is just absolutely jammed with books. Now, in your mind's eye, reach out and try to grab one specific book from right in the middle. It's really wedged in there. And pay close attention to what your hand actually does. It's not just a simple reach, grasp, and pull. Is it? No. It's this little symphony of tiny adjustments. And right there in that unconscious performance lies one of the biggest challenges in the history of AI. You probably don't even think about these movements. Maybe your fingers press against the book next door, just kind of wiggling it to create a tiny bit of space. Or maybe you hook the spine with one finger and slide the book out just enough to get a real grip on it. And what happens if the cover's a little slick and it starts to slip as you pull it out before you can even think, "Oh no, I'm dropping it. Your hand is already tilted. Your grip has changed and you've pinned it against the shelf to save it." This is this invisible dance, this constant super fast conversation between your senses and your muscles. So, this constant fluid interaction, well, researchers have a name for it, physical common sense. And let's break this down because every word here is super important. Reactive, that means it's not pre-planned. It happens in response to the world as it's changing moment to moment. Closed loop, that's just a way of saying there's a constant feedback cycle. Your eyes see the slip, your nerves feel the pressure change, your brain sends a correction, and your senses report back on how it went, all in milliseconds. It's a gut feeling for physics, not the equations, but a real intuition for forces, for friction, for weight, and for all the messiness of the real world. And over your lifetime, all this intuition gets compiled kind of like software, right into your reflexes and muscle memory. You know, this ability to just do things, to handle objects with this kind of grace and adaptability, it feels like second nature to us, but it's really an invisible superpower. It's an intelligence so basic, so ancient that we just take it for granted. But the second we try to build machines that can live and work in our world, we slam it to the reality that this superpower is the one thing they almost always lack. And that brings us right to the heart of the mystery we're unpacking today. How can something a toddler can do without even thinking be one of the toughest, most stubborn problems in robotics? I mean, think about it. We've built AI that can write poetry, compose music, and crush the world's best Go players. A game with more possible moves than atoms in the universe. And yet we still struggle to build a robot that can reliably pick a piece of fruit without smooshing it or clear a dinner table without dropping a glass. To really get this, we've got to look back at how this problem was first spotted. And look, this isn't some new frustration. This is a famous paradox that's been bugging AI researchers for over 50 years. Sometimes it's called the paradox of the toddler for the simple reason that the physical skills of a 2-year-old are in many ways still way beyond our most advanced machines. The history here really puts the whole challenge into perspective. It really starts back in 1966 with a philosopher named Michael Palani. He came up with this idea of tacet knowledge. His famous line was, "We can know more than we can tell." Just think about riding a bike. You can't write a perfect instruction manual for it, can you? The knowledge isn't in words, it's in your body's sense of balance, the little shifts in weight. It's knowledge you can only get by doing. Then fast forward to 1988 and the brilliant roboticist Hans Moravec puts a name to it. His famous paradox. He realized that the things we think of as hard like highle reasoning actually take very little computation while the easy stuff are basic sensor motor skills take enormous computational power. And here we are today and that paradox is still alive and well. And this slide just lays it out so clearly. On the left you've got the world of pure logic. An AI can master chess, something that takes humans years of intense training. It can do calculus in a blink. An industrial robot can do the same exact perfect weld on a car door 10,000 times without ever messing up. But then you look at the right side. This is the stuff evolution has spent hundreds of millions of years getting right. Walking on a patch of ice, recognizing a friend in a crowd, and most importantly for us, adapting to a messy, unpredictable room, and recovering when something, say, starts to slip from your grip. These things are completely effortless for us, but they're exactly where rigid pre-programmed machines just fall apart. So, that just raises the question, why? I mean, we're living in the age of large language models, right? AIS that have basically read the entire internet. Why can't they learn this physical intuition? Well, it turns out the answer isn't about how much data they have, but what kind of data they're learning from? And this is it right here. Palana's idea from the 60s is the absolute key. Physical knowledge just isn't made of words. You can't pack it into a sentence. It's not a list of rules like if book slips then readjust grip. No, it's something that only exists inside that continuous high-speed feedback loop between what you sense and how you act. Now, let's compare that to what language models actually learn from the internet. They learn what's called semantic common sense. And they are masters at this. It's all about understanding the statistical patterns between words and ideas. For example, an LLM knows that if a sentence starts with the bird flew out of its, the next word is probably nest or cage because it's seen that pattern trillions of times. It has a common sense for language, but it's a common sense of text, not texture, of symbols, not slipping. And this analogy is just perfect. Reading the driver's manual is exactly like an AI learning from the internet. It gives you all the background knowledge, the rules of the road, what the signs mean, the theory of it all. That's semantic knowledge. But reading that book a hundred times will never ever prepare you for the actual physical feeling of your car hydroplaning on a wet road or that intuitive flick of the wheel you do when you feel the car start to skid. That is physical knowledge, tacit knowledge. And you can only learn it by actually holding the wheel and feeling the consequences. So here's the bottom line. The entire internet, all of it, is basically a passive third person recording of the world. It has no propriception, no feeling of a body moving through space. It gives no chance for intervention. An AI can't read about a ball and then decide to go push it. And most importantly, it has no consequences. In all those trillions of words, there is no data that captures the feeling of an object slipping. And because of that, there's no data on the reflex you need to catch it. That key ingredient, the interactive closed loop experience, is just completely missing. And when you realize that, it leads to a huge conclusion. If we want to teach machines physical common sense, we need a totally new kind of data. We can't just keep showing them pictures of the world. We need data that is born from physical experience. We have to figure out how to record the dance itself. We have to capture that sensory motor loop. Now, for decades, the go-to method for this was teleoperation. Basically, a human controlling a robot from afar to collect data. But the old ways of doing this were just bad. The interfaces were these clunky joysticks and laggy screens with almost no physical feedback. And this awkward setup forces the human operator to switch off their fast, intuitive, reflexive system one thinking. They have to start using their slow, deliberate system two brain, thinking through every single step. Okay, now move the gripper left. Now close the fingers. The movements you record are stiff, robotic, and totally missing those fluid little corrections that make us so good at this stuff. And when you train an AI on that bad data, well, you get a bad robot. It's jagged, slow, and inefficient. So, the breakthrough here isn't just about getting more data. It's about inventing a way to get the right data. The holy grail is a system for collecting data that's so smooth, so intuitive that the human operator's natural, reflexive behavior just flows right through to the robot. The goal is to make the interface basically disappear so we can finally capture that physical intelligence that evolution has been working on for millions of years. So, how's this actually being done? Well, companies like generalist are building these lightweight ergonomic controllers that let an operator move a robot's hands almost like they're their own. But the real game changer here is highfidelity force feedback. This means the operator can actually feel what the robot is feeling. They can feel the resistance of pushing something, the texture of a surface, the weight of an object. And you know what? A few minutes into using a system like this, something amazing happens. The operator stops planning their moves. They stop thinking and they just start reacting. And the data that comes out of that is a world away from the old stuff. It's rich with the very soul of physical common sense. All those little reflexes, those real-time recoveries, those tiny intuitive corrections. We're at a really cool moment in this whole story. We've defined this huge problem. We've looked at its long history, and we've seen why the old solutions just didn't cut it. And now we're right on the edge of a potential breakthrough. The results of this new approach are honestly mind-blowing. And if this is the kind of stuff that gets you excited, you should definitely subscribe to see where it all goes from here. So, we have this new way of thinking and this new tech for capturing data that's packed with human intuition. The huge question is, what happens when you train a giant AI model on this amazing new data? The answer is you start to see something that looks a lot less like programming and a whole lot more like improvisation. You see these flashes of real physical intelligence. Okay, check out this first example. The robot's task is to put a small metal washer into a really tight foam slot. As it's pushing down, its sensors feel a tiny rotation. The washer starting to slip out of its grip. Now, a normal robot would probably just drop it or jam it in crooked. But this model does something totally different. It pauses the main action, the pushing. It does this super quick, tiny regrasp to stop the slip. And then this is the part that just feels so human. It gives the washer this little double nudge, a quick push and release just to make sure it's seated properly. That final nudge wasn't programmed. It's a learned trick for making sure the job is done right. Here's another one. The robot needs to place a small box into its lid. It picks up the box, but oops, it fumbles it and flips it upside down. A total failure for most robots. But instead of just freezing or dropping it, this model immediately does this fluid inhand regrasp. It basically juggles the box in its fingers to flip it right side up all in one smooth motion. And then just like a person would, it gently pats the box down into the lid a couple of times. That final pat, it's not really necessary, but it shows this intuitive, learned understanding of making sure something is secure. It's incredible. This next one shows an even higher level of problem solving. The robot is trying to put something into a cardboard box, but one of the inner flaps is bent in, blocking the way. A simple robot would just keep pushing and fail. This model though, it sees the problem. It uses its other finger, the one that isn't holding the object as a tool. It reaches out, hooks that annoying flap, and deliberately folds it out of the way. And only after it's cleared the path, does it go on to finish the job. It's solving a problem by using its body in a totally new way. And one last example of this improvisation. The robot needs to grab a Tic Tac container out of a bin, but the container is right up against the wall, so there's no room for the fingers to get around it. The robot solution is so simple and so smart. It realizes it can't just grab it directly. So, first it does a prep move. It uses one finger to just nudge the container away from the wall out into the middle of the bin. That one little move creates just enough space for it to get a perfect stable grip. It actually changed the world a little bit to make its goal possible. So, and this is the most important thing to get here. None of these clever moves, the catch, the flip, the fold, the nudge, none of them were programmed. There's no if then statement for a cardboard flap. These are emergent behaviors. They're learned physical intuitions that just happen when you train a huge AI model on a massive diverse data set that truly captures the physics of the real world. This is the big shift from fragile pre-programmed robots to ones with robust learned intuition. You know, what we're seeing here is way bigger than just making robots less clumsy. This could be the start of a bridge between what we think of as low-level physical reflexes and highle intelligent thinking. It's suggesting that maybe those two things aren't so separate after all. Take a look at this demo where a robot is shown a finished Lego model for just a second and then it has to build copies of it from a jumbled pile of bricks. To pull this off, a single AI model has to work on multiple levels at once. At the very lowest level, it's doing all that physical common sense stuff we just saw, nudging a brick into place, regripping a piece that's a little off. But at the exact same time, it's doing highle reasoning. It has to remember the goal, find the next piece it needs in that messy pile, and plan out the building sequence. And this is where it gets really, really deep. As these physically grounded models get better, that hard line that AI designers have always drawn between low-level motor control and highle strategy, well, it just starts to melt away. Thinking and acting start to become two sides of the same coin, woven together, just like they are for us. Your plan to make coffee is made up of a thousand tiny physical intuitions about handling the filter and pouring the water. As the researchers put it, your highle plans have to happen in real time because gravity doesn't wait for anyone. And this brings us to the biggest takeaway from all of this. For decades, AI research has mostly been a top- down game, starting with abstract logic and symbols and hoping to somehow connect it all to the real world later. But maybe that was all backwards. This new evidence suggests that real general intelligence, the kind that can actually work in our world, doesn't start with symbols. It has to be built up from a foundation of physical experience. It has to start with a body. Real robot intelligence starts with physical common sense. Unlocking this dark matter of robotics could be the tipping point. The moment when robots finally go from being specialized tools stuck in cages to being general purpose helpers in our everyday lives. And it leaves us with one final huge question to think about. We are right at the beginning of this. But what happens to our world, to how we make things, to how we get things, to healthcare, to our own homes? What happens when the machines we build finally develop a real gut feeling for the physical reality we all live in? The possibilities are just staggering, and we'll be here to explain them as they happen. Thanks for tuning in.