Transcript
GOgsuXFjvzg • The Robot Revolution: Why AI is Giving Robots the Power to Dream
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/FoundationModelsForRobotics/.shards/text-0001.zst#text/0003_GOgsuXFjvzg.txt
Kind: captions
Language: en
Welcome to the explainer. So, today
we're going to unpack this paradox
that's sitting right at the heart of
modern AI. And trust me, it's a really
fascinating one. Okay, so let's just
dive right in. You've got these digital
AIs that can literally dream up entire
movies from just a text prompt, right?
But then you look at our physical robots
and they can barely pick up a coffee cup
without messing it up. So, what is going
on here? What's causing this bizarre
disconnect between these, you know,
superhuman AI minds and these super
clumsy bodies of our robots? Well, the
answer, it's changing everything we
thought we knew about building
intelligent machines. So, the story
really starts with the main approach
we've been using for years now. And it
basically treats robots like well, like
reflex agents. They're machines that
react, but they don't really understand.
And this is called the vision language
action model or VLA for short. It's a
really powerful system and it learns by
imitation. I mean, it watches millions
and millions of examples of a task being
done just right, and then it just
connects what it sees to what it's
supposed to do next. It's basically
pattern matching just on a massive,
massive scale. You know, a great way to
think about this is like the system one
part of our own brains. It's that fast,
intuitive, almost gut feeling kind of
thinking. Like when you catch a ball,
you're not sitting there calculating the
physics, right? You just react. That is
what a VA is. It's a high-tech reflex
engine. Okay, now this slide is really
interesting because it shows you the
VA's double-edged sword so clearly. On
the one hand, its big strength is
something called semantic
generalization. Because it's trained on
all that internet data, it can pick up a
Spongebob toy it's never actually seen
in real life just because it gets the
idea of Spongebob. But, and this is a
big butt, its weakness is physical
novelty. So, let's say you train it to
push a block on a wooden table. Then,
you put that same block on a sheet of
ice. It completely fails. Why? because
it has no internal concept of friction.
All it knows is the statistical pattern
it saw in its training data. You change
the physics even slightly and the robot
is totally lost. And that critical flaw,
well, it's sparked a full-blown
revolution in robotics. It's forcing
researchers to build a whole new kind of
AI. One that doesn't just act, but
actually stops and imagines first. So,
say hello to the world model. This thing
is a completely different beast. Instead
of just reacting, it builds a little
mini physics engine inside its own
neural network. It's basically an
internal simulator that's constantly
learning the cause and effect rules of
the physical world. And this table just
lays it all out perfectly. The VA,
that's our fast, reflexive system one
brain. But the generative world model or
GWM, that's our slow, deliberate system
2. It doesn't just react, it predicts.
It doesn't just generalize to new
objects, it generalizes to new physics.
I mean, it's the fundamental difference
between raw instinct and actual reason.
So, how does this thing actually work?
Well, a robot with a world model can
basically play out these little mini
movies of the future right inside its
own head. First, it observes the scene
as it is right now. Then, it imagines a
bunch of different outcomes. What if I
push it this way or what if I try to
lift it from that angle and only after
it's imagined all the consequences does
it pick the best path and then finally
execute the action. Right? And this is
the crucial point. A VLA just acting on
instinct might blindly knock over a
priceless vase because that's what the
pattern told it to do. A world model on
the other hand would simulate that
future, see the vase falling, recognize
that's a bad outcome, and stop itself.
This ability to predict the future, that
is the absolute foundation for creating
robots that are safe and truly
physically intelligent. Okay, so how are
people actually building this stuff? I
mean, this isn't just theory anymore.
Researchers are actively creating these
AI imaginations and they're taking two
really different, really fascinating
approaches. The first group, they're
teaching robots to dream in pixels. The
whole idea here is pretty intuitive,
right? If an AI can generate a totally
realistic video of what's about to
happen next, kind of like those crazy
generative video models like Sora, then
it must on some level really understand
the underlying physics of the world. And
this leads us to models like UniSim.
Now, instead of engineers trying to
handcode a perfect simulation of
reality, which is practically
impossible, UniSim just learns the
simulation directly from watching
videos. This is a gamecher because it
lets a robot practice new skills inside
its own generated dream world. It turns
this static library of old videos into
an infinite interactive training ground.
But, you know, generating photorealistic
video is incredibly slow and it eats up
a ton of computing power. So, this has
led to a completely different school of
thought. one that basically says
dreaming in pixels is a total waste of
time. This camp and it's heavily
influenced by thinkers like Yan Lun over
at Meta. They argue that a robot doesn't
need to predict every single tiny pixel.
I mean, does it really matter what the
exact texture of the wallpaper is? Of
course not. The robot only needs to
understand the abstract concepts, the
relationships that are actually relevant
to the task at hand. And that's how we
get models like VJBA. Instead of trying
to generate the missing pixels in a
video, it just tries to predict an
abstract description of what should be
there, it's a compact set of data called
an embedding. By focusing only on this
core information, it learns the
essential physics of a scene way more
efficiently. This means it can plan much
much faster because it's not wasting
energy rendering all those irrelevant
details. Now, despite all this
incredible progress, I mean, in both
pixel dreamers and concept dreamers,
there is one massive hurdle that's
holding these imagining robots back
from, you know, really taking over our
factories and our homes. Speed. It all
comes down to speed. Right now, these
advanced world models run at less than
five hertz. That means they can generate
fewer than five thoughts or frames per
second. Now, to put that in perspective,
a smoothly controlled robot needs to
make decisions at 20, 50, even 100
hertz. Thinking five times a second,
that's just way, way too slow for the
real world. So, the future, it seems,
isn't about choosing one or the other,
the fast reflex robot or the slow
imagining one. It's about building a
single complete mind that actually
combines the best of both worlds. The
ultimate goal here is a two-part brain.
You've got the fast cortex. That's the
VA model. And it's handling all the
high-speed intuitive movements. It just
knows how to grasp a cup. But when it
runs into something new, something
weird, maybe the cup is slippery or it's
a really odd shape, it calls on the slow
frontal lobe, which is the world model.
This part kicks in, asks what if,
simulates a few different options, and
then sends guidance back down to the
fast system. It's the perfect
combination. Reflex guided by reason.
So, at the end of the day, here's what
we're left with. This shift from simple
imitation to genuine imagination, it
isn't just some technical upgrade. It's
the beginning of a truly new kind of
intelligence. For the last decade, we've
been focused on building the eyes and
the hands of our machines. Now, we're
finally building the imagination. And
that leaves us with a pretty profound
question to think about. What happens
when a robot's imagination becomes even
better than our own?