File TXT tidak ditemukan.
Beyond VLAs: mimic-video and the Future of Generalist Robot Control
3VpXFP5D6zY • 2025-12-24
Transcript preview
Open
Kind: captions Language: en All right. Today we're looking at something that could literally change the game in robotics. I mean, imagine a robot learning how to cook. Not from millions and millions of pictures, but just by watching a YouTube tutorial. Seriously, let's get into it. So, let's start with the big question, right? Why is this so hard? You know, a robot can look at a picture of flour, eggs, and then a photo of the finished cake, no problem. But it's missing the most important part, the how. It has to guess at the physics, the timing, the whole process. And that right there is one of the single biggest hurdles in robotics today. Okay, so the current state-of-the-art, the champs in this field are called vision language action models or VAS for short. And look, they're super powerful. They're trained on these massive internet data sets of images and text. This is what lets them connect a command like say pick up the apple with the actual visual of an apple and then poof, do the action. But, and this is a huge but, there's a fundamental flaw here. Their knowledge comes from static, disconnected images. Think about it. They've seen a million photos of a ball, but they've never seen a video of a ball bouncing. They have no real intuitive grasp of physics or how things change over time. And this baking analogy just nails the difference perfectly. On one hand, learning from static images, that's like seeing a photo of the ingredients and a photo of the final cake. The poor robot has to guess everything that happened in the middle. But learning from video, well, that's like watching the whole cooking show step by step. It sees the mixing, the folding, it sees cause and effect. It learns the process itself. So, what happens? All the heavy lifting of learning actual physics gets pushed onto the robot during its training. And that training requires a ton of super scarce, incredibly expensive data. We're talking about humans literally guiding the robot by hand for hours and hours. It creates this massive data efficiency bottleneck and it's seriously holding back how fast robots can learn new stuff. So the big question is, how do we get past this? How do we break the bottleneck? Well, the research we're looking at today proposes a totally new way of thinking, a complete paradigm shift, teaching robots to learn from motion. And this brings us to a whole new class of models called video action models or VAMS. And the star of our show today is a groundbreaking VAM called mimic video. Now here's the key. Instead of learning from static pictures, it learns directly from the deep internal understanding the latent space if you want to get technical of a powerful pre-trained video model. Okay. So how does this actually work? How does mimic video turn just watching a video into a physical action? The approach is actually pretty brilliant in its simplicity. Basically, you can think of it as a two-part system. You've got the dreamer and the doer. So, first the dreamer. That's the big powerful video model. It doesn't create a perfect video. Instead, it generates this rough kind of fuzzy video plan almost like a dream of what success looks like. Then the doer, a much smaller action decoder, it watches that dream and its job is to translate that highle visual plan into the nitty-gritty precise motor commands the robot needs. Okay, but here's the crazy part. This is what really got me. It turns out a perfect crystal clearar video plan actually makes the robot perform worse. A noisy, blurry, dreamlike plan works way, way better. So why? The answer is just so cool. See, by intentionally keeping the plan a bit noisy and blurry, it forces the action decoder, the doer, to ignore all the little unimportant details. It can't get distracted by, say, a weird shadow or the exact texture of a tablecloth. It has to focus only on the core physics of the action. This makes the whole system way more robust to real world randomness. And as a huge bonus, it's way faster to compute cuz it doesn't have to waste time generating a perfect video. Okay, so the theory sounds amazing, right? But does it actually work in practice? Let's look at the numbers and see how mimic video stacks up against the old school VLA models. First number, get ready for this, 10x. Mimic video is 10 times more data efficient. Just let that sink in for a second. That's a full order of magnitude better. And that's not all. It also learns way faster. It hits its peak performance twice as fast as the standard models. This chart really drives that 10x point home. Take a look. The baseline VA model on the left needed 100% of that expensive robot training data to hit its max performance. Now look at mimic video on the right. It got to the exact same peak performance using only 10% of that data. That is a massive difference. But okay, benchmarks are one thing. The real test, the ultimate test came in the real world. They put this on a seriously complex task. Controlling a twoarmed robot with these incredibly dextrous multi-fingered hands. And these are the kind of tasks where it's super easy for the robot's own arms to get in the way and block the camera's view. A total nightmare scenario. Now, check this out because the setup here is what's really fascinating. The baseline model using just the main workspace camera only succeeded 30% of the time. Not great. So, they gave it a hand literally by adding extra cameras on its wrists. That helped boosting it to about 74% success. But now look at mimic video. With only the single main camera, less information, it hit a 93% success rate. Just incredible. What this tells us is that its internal video-based understanding of the physics is so strong, it can basically predict what's happening even when its own arms are blocking the view. It's like it can see through itself to get the job done. So, what's the big takeaway here? What does this all mean? This isn't just about a slightly better model or a small improvement. This really represents a fundamental shift, a whole new paradigm for how we should be thinking about training robots. The core idea is this simple. We're shifting the burden of learning. We're moving it. Instead of forcing robots to learn physics from a tiny, expensive little pool of robot data, we can now let them learn from the biggest data set of physical interaction that has ever existed. Literally all the videos on the internet. And the magic word here, the whole point is scalability. This approach could finally unlock the ability to teach robots these incredibly complex skills. You know, everything from fixing a car engine to maybe one day even assisting in surgery just by letting them binge watch the massive library of how-to videos that we as humans have already created. And that leaves us with one final kind of mindbending thought to wrap this up. You know, for all of human history, we've been the ones making the instructional videos, right? But if robots can truly learn from our entire collective history of physical knowledge, what new skills, what new insights into the physical world that we've never even thought of might they one day be able to teach us?
Resume
Categories