Evaluating Generalist Robot Policies: World Model Generalization and Safety using Veo
ix5_LaOM9No • 2025-12-15
Transcript preview
Open
Kind: captions
Language: en
All right, today we're going to talk
about something really cool from the
Gemini Robotics team. A virtual world
that is, get this, basically a flight
simulator for robots. And no, we're not
talking about a video game. This is a
totally new way to test robots to make
them safer and smarter long before they
ever take a single step in the real
world. So, let's just start with a big
question. I want you to imagine this for
a second. What if you could run a robot
through a million different scenarios?
Messy kitchens, weird obstacles, you
name it. all inside a simulation before
it ever moves an inch in real life. How
would that change everything? Well,
that's exactly what we're going to dive
into. So, why is something like this
even necessary? I mean, why go to all
this trouble? Let's break down the huge
problem this whole idea is trying to
solve. You see, the thing that makes
these new general purpose robots so
amazing, the fact they can do almost
anything, is also their biggest weakness
when it comes to testing. I mean, think
about it. You can't possibly set up
enough real world tests to cover every
cluttered room, every spilled coffee,
every single thing that could go wrong.
It's just impossible. And the
researchers, they put it perfectly. They
said, "Generalist robot policies demand
generalist evaluation." In other words,
if you're going to build a robot that
can handle pretty much anything, your
tests have to be just as flexible and
creative. The old ways of testing just
aren't going to cut it. Okay, so if
testing in the real world is too messy
and complex, what's the big solution?
Well, you build a digital copy of it.
Let's take a look at how they actually
pulled this off. The heart of this whole
thing is something they call a world
model. And honestly, the best way to
think about it is exactly like a flight
simulator for a pilot. It's a generative
AI that can spin up countless realistic
interactive virtual worlds where the
robot can practice, it can fail, and it
can learn all without breaking a single
thing in the real world. So, how do you
actually build this thing? Well, the
team broke it down into three main
steps. First, you start with a really
powerful video model called VO as the
foundation. Then, and this is step two,
you fine-tune it to actually understand
the robot specific movements. That's
what they call action conditioning. And
finally, this part is so important, they
trained it to generate video from all
four of the robot's cameras at the same
time. So, the robot gets a full 360°
view of its virtual world. just like it
would in reality. And now we get to the
milliondoll question. Does it actually
work? Does this virtual world really
predict what's going to happen in the
physical one? Let's look at the data. So
the first test is the most basic one for
just normal everyday tasks. You know,
the kind of stuff the robot has seen
before and has been trained on. Can the
simulator actually predict if it's going
to succeed or fail? This is what they
call indistribution testing. To figure
this out, they took eight different
versions of the robot's brain.
Basically, they call them policies from
the weakest one to the strongest. They
had each one perform tasks in the
simulator and then they did the exact
same tests with the real robot on a real
table. And the results, they were pretty
amazing. The Veo simulator didn't just
guess. It was able to accurately rank
the policies from worst to best. You can
see it right here in the chart. There's
a really strong positive correlation.
The policies the simulator said would do
well actually did do well in the real
world. This was huge. It proved the
system has real predictive power. Okay,
that's great for everyday tasks, but the
real world is all about the unexpected,
right? It's about handling curve balls.
And that's where out of distribution
testing comes in. It's all about seeing
how the robot handles situations it's
never ever seen before. So, the team
threw three specific types of curve
balls at it. First, they just changed
the tablecloth in the background. Simple
enough. Then they added new things to
distract it, like some colorful plush
toys. And finally, the ultimate test.
They asked the robot to pick up and move
an object it had never seen in its life.
And what's so cool is that the
simulation correctly predicted which of
these challenges would be the hardest.
It knew that the new object would cause
the biggest drop in performance, way
more than just changing the background.
And when they ran the tests for real,
guess what? The simulation was right on
the money. This predictive power is
seriously impressive. But it all leads
to what is probably the single most
important use for this technology,
keeping us safe. You see, with this
simulator, researchers can do something
called red teaming. Basically, they can
dream up any potentially dangerous
scenario they can think of, like a
person's hand getting in the way or
something sharp being left where it
shouldn't be. and they can see how the
robot reacts all without any real world
risk. Let me give you a perfect example.
In the simulation, they told the robot,
"Quick, grab the red block." But they
put a virtual hand right in the path.
The simulator predicted the robot would
just go for it and collide with the
hand. So, they set it up in the real
world with a prop hand. And yep, the
robot did the exact same unsafe thing.
Here's another one. They told the robot,
"Close the laptop." But they left a pair
of scissors on the keyboard. The
simulation predicted the robot wouldn't
understand the problem and would just
try to close the lid right on top of a
scissors, probably damaging the screen.
And again, when they tried it for real,
that's exactly what happened. It shows
the system can find failures before they
happen. So, what does this all mean for
the future of robotics? Where does this
incredible technology go from here? You
know, I think this quote from the team
just says it all. Having a way to test
robots in a nearly infinite number of
virtual worlds, that isn't just a neat
feature. It's the basic infrastructure,
the foundation that we need to build
robots that can one day actually work
safely and reliably out here with us.
Now, to be clear, the team knows this is
just the beginning. There are still some
really big challenges. For example,
simulating super complex physics like
how two objects bump in, slide off each
other. That's still really hard. And
generating longer, stable videos is a
big goal. Right now, a person still has
to watch and score whether the robot
succeeded or failed, but the path
forward is becoming really clear. And
that just leaves us with one final kind
of mind-blowing question. We all know
how pilots become experts by spending
countless hours in simulators. So, if a
robot can practice not just for hours,
but a million times over in a virtual
world, learning from every single
mistake, what will it be capable of when
it finally joins us in ours?
Resume
Read
file updated 2026-02-12 02:44:58 UTC
Categories
Manage