File TXT tidak ditemukan.
Transcript
O0_R1Cs-ke8 • How Robots Learn from Video: Inside the 1X World Model (1XWM)
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/FoundationModelsForRobotics/.shards/text-0001.zst#text/0065_O0_R1Cs-ke8.txt
Kind: captions Language: en Hey, welcome to the explainer. Today we're diving into something that could, and I mean really could, change our relationship with machines forever. We're looking at how the company 1X is teaching its humanoid robots to understand and interact with the real world, not with lines of code, but by having them watch millions of videos on the internet. Now, this isn't just some new trick. It's a completely different way of thinking about robot intelligence. Let's get right into it. You know, this is the one question that's really had roboticists scratching their heads for decades. It's one thing to program a robot for a super repetitive factory task. I mean, it can weld the same spot on a car door a million times and never miss, but how do you teach it something that seems so simple to us, like wiping a kitchen counter? How does it learn that subtle difference in pressure you need for a stubborn stain versus, you know, just a light spill? Or how to pick up a delicate wine glass it's never seen before and just know that it needs a gentle touch? This intuitive grasp of the world, of cause and effect, that's what we call common sense. And it's really been the final frontier, separating clunky machines from truly helpful assistant. So to really get why this is such a huge deal, we've got to look at the old way versus the new way. The traditional method is something called a vision language action model or a VLA. Basically, you show a robot a picture and give it a text command. The problem, it needs a staggering amount of robot specific data for every single task. A person has to literally handhold the robot through an action thousands of times. It's slow, it's expensive, and it just doesn't scale. The new way, the one 1X is pioneering, is the video world model. Instead of static photos, it learns from motion. It watches millions of human videos to understand not just what something is, but how it moves, how it behaves. It's learning the physics of the world, which lets it apply that knowledge to totally new situations. It's a complete paradigm shift. All right, so here's the game plan for today's deep dive. First, we're going to really define the robot learning problem. Why this has been such a tough nut to crack. Then we'll explore this breakthrough idea of learning from internet videos. After that, we'll get into the secret sauce, the 1x WHIM training recipe. Then comes the really fun part. We'll see the robot Neo put to the test. Next, we'll investigate this crucial link between what the robot imagines and what it can actually do. And finally, we'll wrap up by looking at the incredible future this kind of technology is unlocking. Okay, let's really dig into the heart of the problem 1X is trying to solve here. What is it that makes it so unbelievably hard to give a robot a physical understanding of our world? So, that old way we mentioned, it's all based on these things called VLAS. You can think of them as an add-on to the large language models we're all getting used to. You start with a powerful vision language model, a VLM, which is awesome at looking at a picture and telling you what's in it. It can say, "That's a cup. That's a table." Then you basically bolt on this action module to it. The idea is to take its knowledge of what a cup is and somehow translate that into the actual physical movements needed to pick it up. Yeah, but here's where that whole approach just hits a wall. First off, a VLM can identify a cup, but it has zero understanding of physics. It doesn't know the cup will fall if it lets go or that it might be full of hot coffee. So to teach it that stuff, you need a ton of robot specific data, which is just crazy expensive and timeconuming to get. And that right there is a massive bottleneck. You can't possibly show it every object in every scenario. So what happens? Engineers end up having to write extra code to basically hardcode physics rules, which is clunky and breaks the second the robot sees something new. It's pretty clear this approach has run its course. So 1X looked at all these limitations and basically said, "Okay, let's flip this whole thing on its head. What if instead of trying to brute force a robot's understanding with endless demonstrations, you could just let it learn by watching?" And that brings us to the new paradigm. learning from the single biggest data set of physical interactions ever created, the internet. And this quote from their research, this is the light bulb moment. Just think about it for a second. Every single time you watch a cooking tutorial on YouTube or some DIY video on Tik Tok, you're basically watching a free master class in physics. You see the exact angle you need to pour liquid without spilling. The way a hand just instinctively shapes itself to grab different things. That subtle little twist of the wis you need to open a stubborn jar. You see that a ball bounces, a feather floats, a glass shatters. The internet is this massive untapped library of physical knowledge just waiting for the right student to come along. But, and this is the really critical piece that makes this whole thing click. Why can 1 X's robot Neo learn from these videos when say a factory arm or one of those doglike robots can't? It all comes down to the humanoid advantage. See, because Neo is built to look and move like a person with similar proportions, similar joints, it can directly map what it sees in human videos onto its own body. Put it simply, its body moves so much like ours that it can actually copy our movements. A robot on wheels can't learn to open a cupboard from watching a person cuz well, it doesn't have arms. But Neo does. What a human does in a video, Neo can try to do in the real world. Okay, so the big idea is genius, right? Learn from videos. use a human-like body. But how how does that actually work? I mean, how do you turn pixels from a random internet video into a precise command for a robot's motor? Well, this is where we get into the absolutely brilliant training recipe for the 1x world model. So, they've broken the whole system down into two really elegant parts. The first is the world model, and you can think of this as the robot's imagination. You give it a picture of what it's seeing right now and a command like open the drawer. The world model gets to work and actually generates a short video, a plausible future of its successfully opening that drawer. It literally imagines what success looks like. But imagination alone doesn't move a robot. That's where the second part comes in, the inverse dynamics model or IDM. The IDM is like the robot's muscle memory. It watches that imagined video from the world model and translates it frame by frame into the exact electrical signals needed to move the robot's joints and make that imagined future a reality. So the world model figures out what to do and the IDM figures out how to do it. Okay, now this is a really clever little trick they use to make the robot's imagination even better. See, the giant video models it learns from were trained on the internet with really rich descriptive text. So just giving the robot a simple command like pickup cup is like trying to paint a masterpiece with only three colors. To fix this, they use another AI for something they call caption upsampling. It takes that simple command and just fleshes it out. pickup cup becomes something way more detailed, like a robot hand approaches the white ceramic mug from above and to the right, its fingers shaping to gently close around the handle. This richer prompt gives the world model a much clearer target to aim for, which leads to a way more accurate imagined video. The training itself is like a funnel, right? It starts super broad and gets more and more specific. Stage one is webcale pre-training. This is where the model just binges millions of internet videos to learn the basic laws of physics. How things fall, how they roll, how they bounce. It's basically learning the grammar of reality. Stage two is egocentric mid training. This part's key. The model is shown 900 hours of video filmed from a human's point of view. This is where it learns the nitty-gritty of manipulation, how hands grip and twist and interact with stuff up close. And finally, stage three is embodiment fine-tuning. They use a much smaller 70-hour data set of the actual Neo robot to adapt all that general knowledge to its specific body, its cameras, its weight. It's like teaching a world-class guitarist how to play a brand new customuilt instrument. And this slide right here, this is what really shows you how powerful that training funnel is. Just look at the data they use for that final robot specific tuning stage. It's almost all just simple pickandplace tasks. There are no demonstrations of opening doors or tidying up a room or using a watering can. And yet, Neo can do all of those things. That's the crazy part. The robot's ability to do these complex new tasks wasn't learned from its own limited data. It was transferred over from that huge library of knowledge it got from all the internet and human videos. The common sense came from the first two stages, not the last one. All right, so the theory is solid. The training process is super clever, but does it actually work in the real world? Let's see what happens when we move from the training data to reality and really put Neo to the test. Okay, first up, let's see how it handles something new. Here, Neo is looking at a toy dinosaur, an object it has definitely never seen before. So, on the left, you're seeing its imagination. The world model generates a little video plan of how to approach and grab this weird shape. And on the right, boom, the real robot executes that plan perfectly. It's not just repeating some motion it memorized. It's looking at a new object and creating a successful strategy from scratch. This shows it's not just memorizing. It's actually understanding. Okay, so that was a new object. What about a totally new behavior? Remember that donut chart? Neo was never ever trained to water a plant. There's no watering can data. But its world model has seen people do it a thousand times online so it can generate a plan, an imagined video of the right way to do it. And because of that humanoid advantage, the real Neo can turn that video into action, successfully watering the plant. And that is the absolute magic of this whole system. Knowledge transferred straight from YouTube to a robot's hand. Now, this next one might be the most impressive demo of them all. That specific robot training data had zero two-handed tasks. I mean, in normal robotics, getting two arms to work together is a huge headache. Usually needs a ton of custom code. Yet, here we see the world model. Imagine a plan for a two-handed task like opening a container and the real robot just does it. This ability comes entirely from the physical understanding it learned from watching humans use both of their hands together in web videos. It learned this incredibly complex skill just by watching. But okay, let's get down to brass tax. How did it do in the numbers game? Well, across 30 trials for each task, the results are really strong. For tasks that are kind of like it's training, like grabbing a bag of chips, it's successful 90% of the time. But even for things it was never trained on, like sliding a door or using a watering can, the success rates are surprisingly high. But hey, it's also super important to be real about where it's at right now. For tasks that need really fine motor control, like pouring cereal without spilling or drawing a smiley face, the success rate for now is zero. This probably means they need to improve the physics in the world model or get better sensors in the robot's hand. I mean, that's pretty incredible stuff, right? A clear look at both the power and the current limitations. If you're finding this deep dive into the future of robotics as fascinating as I am, now would be a great time to subscribe to the explainer so you don't miss our future analyses on the tech that's literally shaping our world. So, we've seen what Neo can do, and we know it imagines a plan first. This leads to a really, really important question. Is there a real measurable connection between how good that imagined video is and how well the robot actually does the task? Let's just put the question out there, plain and simple. If the world model creates a video that looks more physically correct, more accurate, or just better, does that directly lead to a higher success rate when the actual robot tries to do it? It's a fundamental question about this whole approach. And luckily, 1X ran some really clever experiments to find out. And here's our first big clue. For the task of pulling a tissue out of a box, if they just let the model come up with one plan and go for it, the success rate was 30%. But then they tried something. What if they let it generate eight different video plans at the same time? Then the robot could just pick the one that looked the most realistic. And by doing just that, the success rate jumped from 30% all the way to 45%. Think about that. A 50% jump in performance just by letting the robot think through a few options and pick the best one. This is a huge sign that a better imagined plan leads to a better real world result. So to really figure out what the secret sauce is, the researchers did what's called an ablation study. Basically, they started taking ingredients out to see what would happen. Let's start with the most stripped down version of the model. No first-person human video and no descriptive captions. They had people rate how realistic the generated videos were. For tasks it was familiar with, it was okay, about 35% approval. But for new tasks or totally weird scenarios, the quality dropped off a cliff down to 20% and 15%. So, this is our baseline. Okay, now let's add just one of those ingredients back in. The descriptive upsampled captions. The effect was immediate. The human acceptance rate for the videos jumped up across the board. Look at that. A 10-point jump in every single category. The model's imagination got way better just by giving it clearer instructions. Better input leads to better imagined outputs. And finally, let's look at the full complete model with everything included. The results here are just dramatic, especially for new tasks and OD, which means out of distribution. The acceptance rate for new tasks just skyrockets from 30% to 60%. This right here is the smoking gun. It proves that those 900 hours of first-person human video are the absolute key ingredient that teaches the model how to generalize. It's what allows it to accurately imagine how to do things that's never been explicitly shown before. But this this is the slide that brings it all home. Does a better video actually lead to a better robot? They tested the different models on a tough task, scrubbing a dish. The weaker models, the ones that made those lower quality videos, both had a 0% success rate. Total failure. Only the full 1xWM, the one that made the highest quality videos, was able to succeed at all, hitting a 20% success rate. So, yeah, this pretty much confirms it. A better imagination makes for a better robot. Period. Okay, so we have seen some stuff that is just absolutely mind-bending. But you know, this tech is still brand new. It's very much in development. And to really give you the full picture, we have to talk about the current limitations and the challenges that are still ahead. And to their credit, the 1X team is super upfront about the challenges they still need to solve. First, speed. Right now, it takes 11 seconds of compute time to generate a 5-second plan. That's fine for some tasks, but for a robot to work smoothly alongside people, that delay needs to shrink a lot. Second, 3D grounding. Since it learns from 2D internet videos, the model can sometimes mess up depth perception, causing the robot's hand to stop just short of an object or go too far. Integrating depth sensors will be a huge next step. And finally, long-term planning. The system plans everything in these little 5-second chunks. to do a bigger task like washing a whole sink of dishes, it's going to need a memory of what it's already done and the ability to replan if something unexpected happens, like it drops the sponge. But even with those hurdles, the vision for where this is all going, that's what's so incredible. This quote captures the end goal perfectly, a flywheel of self-improvement. Because the robot can try new things based on what it's seen in videos, and because it can watch the results of its own actions, it can start to learn from its own successes and failures. It can explore its world, try stuff out, and get better all on its own without a human needing to show it everything. This is the leap from a robot that is just taught to a robot that can truly learn. And that brings us right back to where we started with that tricky problem of common sense. This video first approach seems like the most promising path anyone has found yet to actually solving it. It changes the whole game from how do we collect enough perfect data to how do we build better, more accurate imaginations. So, I'll leave you with this to think about. What happens when a robot can teach itself to master any task in any home simply by watching the endless library of human experience on the internet and then just practicing? Yeah, the implications are just staggering to think about. And you can bet we'll be following this very closely. To make sure you don't miss our next explainer, be sure to subscribe and hit that notification bell. Thanks for joining us on the explainer.