Transcript

File TXT tidak ditemukan.
Transcript
O0_R1Cs-ke8 • How Robots Learn from Video: Inside the 1X World Model (1XWM)
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/FoundationModelsForRobotics/.shards/text-0001.zst#text/0065_O0_R1Cs-ke8.txt
Back Raw
Kind: captions
Language: en
Hey, welcome to the explainer. Today
we're diving into something that could,
and I mean really could, change our
relationship with machines forever.
We're looking at how the company 1X is
teaching its humanoid robots to
understand and interact with the real
world, not with lines of code, but by
having them watch millions of videos on
the internet. Now, this isn't just some
new trick. It's a completely different
way of thinking about robot
intelligence. Let's get right into it.
You know, this is the one question
that's really had roboticists scratching
their heads for decades. It's one thing
to program a robot for a super
repetitive factory task. I mean, it can
weld the same spot on a car door a
million times and never miss, but how do
you teach it something that seems so
simple to us, like wiping a kitchen
counter? How does it learn that subtle
difference in pressure you need for a
stubborn stain versus, you know, just a
light spill? Or how to pick up a
delicate wine glass it's never seen
before and just know that it needs a
gentle touch? This intuitive grasp of
the world, of cause and effect, that's
what we call common sense. And it's
really been the final frontier,
separating clunky machines from truly
helpful assistant. So to really get why
this is such a huge deal, we've got to
look at the old way versus the new way.
The traditional method is something
called a vision language action model or
a VLA. Basically, you show a robot a
picture and give it a text command. The
problem, it needs a staggering amount of
robot specific data for every single
task. A person has to literally handhold
the robot through an action thousands of
times. It's slow, it's expensive, and it
just doesn't scale. The new way, the one
1X is pioneering, is the video world
model. Instead of static photos, it
learns from motion. It watches millions
of human videos to understand not just
what something is, but how it moves, how
it behaves. It's learning the physics of
the world, which lets it apply that
knowledge to totally new situations.
It's a complete paradigm shift. All
right, so here's the game plan for
today's deep dive. First, we're going to
really define the robot learning
problem. Why this has been such a tough
nut to crack. Then we'll explore this
breakthrough idea of learning from
internet videos. After that, we'll get
into the secret sauce, the 1x WHIM
training recipe. Then comes the really
fun part. We'll see the robot Neo put to
the test. Next, we'll investigate this
crucial link between what the robot
imagines and what it can actually do.
And finally, we'll wrap up by looking at
the incredible future this kind of
technology is unlocking. Okay, let's
really dig into the heart of the problem
1X is trying to solve here. What is it
that makes it so unbelievably hard to
give a robot a physical understanding of
our world? So, that old way we
mentioned, it's all based on these
things called VLAS. You can think of
them as an add-on to the large language
models we're all getting used to. You
start with a powerful vision language
model, a VLM, which is awesome at
looking at a picture and telling you
what's in it. It can say, "That's a cup.
That's a table." Then you basically bolt
on this action module to it. The idea is
to take its knowledge of what a cup is
and somehow translate that into the
actual physical movements needed to pick
it up. Yeah, but here's where that whole
approach just hits a wall. First off, a
VLM can identify a cup, but it has zero
understanding of physics. It doesn't
know the cup will fall if it lets go or
that it might be full of hot coffee. So
to teach it that stuff, you need a ton
of robot specific data, which is just
crazy expensive and timeconuming to get.
And that right there is a massive
bottleneck. You can't possibly show it
every object in every scenario. So what
happens? Engineers end up having to
write extra code to basically hardcode
physics rules, which is clunky and
breaks the second the robot sees
something new. It's pretty clear this
approach has run its course. So 1X
looked at all these limitations and
basically said, "Okay, let's flip this
whole thing on its head. What if instead
of trying to brute force a robot's
understanding with endless
demonstrations, you could just let it
learn by watching?" And that brings us
to the new paradigm. learning from the
single biggest data set of physical
interactions ever created, the internet.
And this quote from their research, this
is the light bulb moment. Just think
about it for a second. Every single time
you watch a cooking tutorial on YouTube
or some DIY video on Tik Tok, you're
basically watching a free master class
in physics. You see the exact angle you
need to pour liquid without spilling.
The way a hand just instinctively shapes
itself to grab different things. That
subtle little twist of the wis you need
to open a stubborn jar. You see that a
ball bounces, a feather floats, a glass
shatters. The internet is this massive
untapped library of physical knowledge
just waiting for the right student to
come along. But, and this is the really
critical piece that makes this whole
thing click. Why can 1 X's robot Neo
learn from these videos when say a
factory arm or one of those doglike
robots can't? It all comes down to the
humanoid advantage. See, because Neo is
built to look and move like a person
with similar proportions, similar
joints, it can directly map what it sees
in human videos onto its own body. Put
it simply, its body moves so much like
ours that it can actually copy our
movements. A robot on wheels can't learn
to open a cupboard from watching a
person cuz well, it doesn't have arms.
But Neo does. What a human does in a
video, Neo can try to do in the real
world. Okay, so the big idea is genius,
right? Learn from videos. use a
human-like body. But how how does that
actually work? I mean, how do you turn
pixels from a random internet video into
a precise command for a robot's motor?
Well, this is where we get into the
absolutely brilliant training recipe for
the 1x world model. So, they've broken
the whole system down into two really
elegant parts. The first is the world
model, and you can think of this as the
robot's imagination. You give it a
picture of what it's seeing right now
and a command like open the drawer. The
world model gets to work and actually
generates a short video, a plausible
future of its successfully opening that
drawer. It literally imagines what
success looks like. But imagination
alone doesn't move a robot. That's where
the second part comes in, the inverse
dynamics model or IDM. The IDM is like
the robot's muscle memory. It watches
that imagined video from the world model
and translates it frame by frame into
the exact electrical signals needed to
move the robot's joints and make that
imagined future a reality. So the world
model figures out what to do and the IDM
figures out how to do it. Okay, now this
is a really clever little trick they use
to make the robot's imagination even
better. See, the giant video models it
learns from were trained on the internet
with really rich descriptive text. So
just giving the robot a simple command
like pickup cup is like trying to paint
a masterpiece with only three colors. To
fix this, they use another AI for
something they call caption upsampling.
It takes that simple command and just
fleshes it out. pickup cup becomes
something way more detailed, like a
robot hand approaches the white ceramic
mug from above and to the right, its
fingers shaping to gently close around
the handle. This richer prompt gives the
world model a much clearer target to aim
for, which leads to a way more accurate
imagined video. The training itself is
like a funnel, right? It starts super
broad and gets more and more specific.
Stage one is webcale pre-training. This
is where the model just binges millions
of internet videos to learn the basic
laws of physics. How things fall, how
they roll, how they bounce. It's
basically learning the grammar of
reality. Stage two is egocentric mid
training. This part's key. The model is
shown 900 hours of video filmed from a
human's point of view. This is where it
learns the nitty-gritty of manipulation,
how hands grip and twist and interact
with stuff up close. And finally, stage
three is embodiment fine-tuning. They
use a much smaller 70-hour data set of
the actual Neo robot to adapt all that
general knowledge to its specific body,
its cameras, its weight. It's like
teaching a world-class guitarist how to
play a brand new customuilt instrument.
And this slide right here, this is what
really shows you how powerful that
training funnel is. Just look at the
data they use for that final robot
specific tuning stage. It's almost all
just simple pickandplace tasks. There
are no demonstrations of opening doors
or tidying up a room or using a watering
can. And yet, Neo can do all of those
things. That's the crazy part. The
robot's ability to do these complex new
tasks wasn't learned from its own
limited data. It was transferred over
from that huge library of knowledge it
got from all the internet and human
videos. The common sense came from the
first two stages, not the last one. All
right, so the theory is solid. The
training process is super clever, but
does it actually work in the real world?
Let's see what happens when we move from
the training data to reality and really
put Neo to the test. Okay, first up,
let's see how it handles something new.
Here, Neo is looking at a toy dinosaur,
an object it has definitely never seen
before. So, on the left, you're seeing
its imagination. The world model
generates a little video plan of how to
approach and grab this weird shape. And
on the right, boom, the real robot
executes that plan perfectly. It's not
just repeating some motion it memorized.
It's looking at a new object and
creating a successful strategy from
scratch. This shows it's not just
memorizing. It's actually understanding.
Okay, so that was a new object. What
about a totally new behavior? Remember
that donut chart? Neo was never ever
trained to water a plant. There's no
watering can data. But its world model
has seen people do it a thousand times
online so it can generate a plan, an
imagined video of the right way to do
it. And because of that humanoid
advantage, the real Neo can turn that
video into action, successfully watering
the plant. And that is the absolute
magic of this whole system. Knowledge
transferred straight from YouTube to a
robot's hand. Now, this next one might
be the most impressive demo of them all.
That specific robot training data had
zero two-handed tasks. I mean, in normal
robotics, getting two arms to work
together is a huge headache. Usually
needs a ton of custom code. Yet, here we
see the world model. Imagine a plan for
a two-handed task like opening a
container and the real robot just does
it. This ability comes entirely from the
physical understanding it learned from
watching humans use both of their hands
together in web videos. It learned this
incredibly complex skill just by
watching. But okay, let's get down to
brass tax. How did it do in the numbers
game? Well, across 30 trials for each
task, the results are really strong. For
tasks that are kind of like it's
training, like grabbing a bag of chips,
it's successful 90% of the time. But
even for things it was never trained on,
like sliding a door or using a watering
can, the success rates are surprisingly
high. But hey, it's also super important
to be real about where it's at right
now. For tasks that need really fine
motor control, like pouring cereal
without spilling or drawing a smiley
face, the success rate for now is zero.
This probably means they need to improve
the physics in the world model or get
better sensors in the robot's hand. I
mean, that's pretty incredible stuff,
right? A clear look at both the power
and the current limitations. If you're
finding this deep dive into the future
of robotics as fascinating as I am, now
would be a great time to subscribe to
the explainer so you don't miss our
future analyses on the tech that's
literally shaping our world. So, we've
seen what Neo can do, and we know it
imagines a plan first. This leads to a
really, really important question. Is
there a real measurable connection
between how good that imagined video is
and how well the robot actually does the
task? Let's just put the question out
there, plain and simple. If the world
model creates a video that looks more
physically correct, more accurate, or
just better, does that directly lead to
a higher success rate when the actual
robot tries to do it? It's a fundamental
question about this whole approach. And
luckily, 1X ran some really clever
experiments to find out. And here's our
first big clue. For the task of pulling
a tissue out of a box, if they just let
the model come up with one plan and go
for it, the success rate was 30%. But
then they tried something. What if they
let it generate eight different video
plans at the same time? Then the robot
could just pick the one that looked the
most realistic. And by doing just that,
the success rate jumped from 30% all the
way to 45%. Think about that. A 50% jump
in performance just by letting the robot
think through a few options and pick the
best one. This is a huge sign that a
better imagined plan leads to a better
real world result. So to really figure
out what the secret sauce is, the
researchers did what's called an
ablation study. Basically, they started
taking ingredients out to see what would
happen. Let's start with the most
stripped down version of the model. No
first-person human video and no
descriptive captions. They had people
rate how realistic the generated videos
were. For tasks it was familiar with, it
was okay, about 35% approval. But for
new tasks or totally weird scenarios,
the quality dropped off a cliff down to
20% and 15%. So, this is our baseline.
Okay, now let's add just one of those
ingredients back in. The descriptive
upsampled captions. The effect was
immediate. The human acceptance rate for
the videos jumped up across the board.
Look at that. A 10-point jump in every
single category. The model's imagination
got way better just by giving it clearer
instructions. Better input leads to
better imagined outputs. And finally,
let's look at the full complete model
with everything included. The results
here are just dramatic, especially for
new tasks and OD, which means out of
distribution. The acceptance rate for
new tasks just skyrockets from 30% to
60%. This right here is the smoking gun.
It proves that those 900 hours of
first-person human video are the
absolute key ingredient that teaches the
model how to generalize. It's what
allows it to accurately imagine how to
do things that's never been explicitly
shown before. But this this is the slide
that brings it all home. Does a better
video actually lead to a better robot?
They tested the different models on a
tough task, scrubbing a dish. The weaker
models, the ones that made those lower
quality videos, both had a 0% success
rate. Total failure. Only the full 1xWM,
the one that made the highest quality
videos, was able to succeed at all,
hitting a 20% success rate. So, yeah,
this pretty much confirms it. A better
imagination makes for a better robot.
Period. Okay, so we have seen some stuff
that is just absolutely mind-bending.
But you know, this tech is still brand
new. It's very much in development. And
to really give you the full picture, we
have to talk about the current
limitations and the challenges that are
still ahead. And to their credit, the 1X
team is super upfront about the
challenges they still need to solve.
First, speed. Right now, it takes 11
seconds of compute time to generate a
5-second plan. That's fine for some
tasks, but for a robot to work smoothly
alongside people, that delay needs to
shrink a lot. Second, 3D grounding.
Since it learns from 2D internet videos,
the model can sometimes mess up depth
perception, causing the robot's hand to
stop just short of an object or go too
far. Integrating depth sensors will be a
huge next step. And finally, long-term
planning. The system plans everything in
these little 5-second chunks. to do a
bigger task like washing a whole sink of
dishes, it's going to need a memory of
what it's already done and the ability
to replan if something unexpected
happens, like it drops the sponge. But
even with those hurdles, the vision for
where this is all going, that's what's
so incredible. This quote captures the
end goal perfectly, a flywheel of
self-improvement. Because the robot can
try new things based on what it's seen
in videos, and because it can watch the
results of its own actions, it can start
to learn from its own successes and
failures. It can explore its world, try
stuff out, and get better all on its own
without a human needing to show it
everything. This is the leap from a
robot that is just taught to a robot
that can truly learn. And that brings us
right back to where we started with that
tricky problem of common sense. This
video first approach seems like the most
promising path anyone has found yet to
actually solving it. It changes the
whole game from how do we collect enough
perfect data to how do we build better,
more accurate imaginations. So, I'll
leave you with this to think about. What
happens when a robot can teach itself to
master any task in any home simply by
watching the endless library of human
experience on the internet and then just
practicing? Yeah, the implications are
just staggering to think about. And you
can bet we'll be following this very
closely. To make sure you don't miss our
next explainer, be sure to subscribe and
hit that notification bell. Thanks for
joining us on the explainer.