Transcript
Ft8G7tH9IUo • Gemini Robotics 1.5: Thinking Robots & Zero-Shot Skill Transfer across embodiments
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/FoundationModelsForRobotics/.shards/text-0001.zst#text/0020_Ft8G7tH9IUo.txt
Kind: captions
Language: en
All right, let's dive into something
that feels like it's straight out of
science fiction, but it's happening
right now. A new technical report just
dropped, and honestly, it points to a
massive leap forward in robotics. We're
not talking about a small update here.
This is a complete reimagination of how
robots can learn, think, and actually
operate in the real world. So, in this
explainer, we're going to break down the
three huge innovations that are making
this all possible. To really get why
this is such a big deal, you got to ask
this one simple question. For years,
we've had robots that are fantastic at
doing things. You give them a script,
they follow it perfectly. But the
problem is they haven't been very good
at thinking. What happens when something
unexpected occurs? When the world
doesn't follow the script, well, that's
where things usually fall apart. And
this slide really nails the difference.
On the left, you've got the old way.
Robots that are rigid, programmed for
just one job. If you want to change the
task, you pretty much have to start over
from square one, retraining it for every
single new thing. But on the right,
that's the future we're talking about. A
robot that's adaptive, that can reason,
and get this, can learn skills and apply
them across different kinds of robot
bodies. So, how in the world do they
pull that off? Well, the first major
breakthrough is giving robots something
like an inner monologue, a way for them
to actually think through a problem
before they even make a move.
The official term for this is embodied
thinking. But what it really means is
that the robot can take a big
complicated command from a human and on
its own break it down into a bunch of
smaller simpler steps. You know, it's
the difference between just blindly
following instructions and actually
coming up with a plan. Let's make this
concrete cuz this example from the
report is perfect. Imagine you say,
"Hey, pack my suitcase for a trip to
London." A normal robot would just sit
there waiting for you to list every
single item. But this system, its
orchestrator can actually use a tool
like a web search to check the weather
in London. It sees, oh, it's probably
going to rain, so it makes a plan. Okay,
I need to pack the rain jacket. Then the
action model takes that simple idea and
turns it into all the physical motions
needed to actually do it. But here's the
thing. The true test of intelligence
isn't just about following a plan when
everything goes perfectly. It's what you
do when things go wrong. Because in the
real world, things always go wrong. And
here we see a perfect example of that.
The Apollo humanoid robot messing up
while trying to handle a water bottle.
And this this is where the magic
happens. Thanks to embodied thinking,
the robot doesn't just freeze up or give
you an error message. Its inner
monologue adapts in real time. It
recognizes the mistake. Oops, the bottle
slipped. And immediately thinks up a new
plan. Okay, let's try picking it up with
the other hand. Being able to recover
from mistakes on the fly like this is a
huge, huge step towards making robots we
can actually rely on in messy,
unpredictable environments. Okay, so
that's the first mind-blowing idea. Now
for the second one, which solves a
problem that has held back robotics for
decades, the data bottleneck. I mean,
how do you get enough data to train a
robot without spending years and years
collecting it for just one specific
machine? Well, the answer turns out to
be brilliantly simple. You don't. You
train one single brain that can control
lots of different bodies. So, here we
have three totally different robots.
You've got Aloha, which is this tabletop
system. Then there's Francoa, a super
precise industrial arm, and of course,
Apollo, the full-on humanoid. And the
amazing part is the exact same AI model,
the same brain is running all three of
them without having to be retrained for
each one. And get this, the data shows
that this approach doesn't just make
training faster, it actually makes the
model smarter. Look at this. When you
train the AI on data from just one type
of robot, its performance score is about
54%. But when you train it on data from
all the different robots, its
performance on that original robot jumps
all the way to 76%. and even higher with
the full system. That's the crazy
insight here. Skills learned on a
humanoid robot can actually make a
little tabletop robot better and vice
versa. It's a shared pool of knowledge.
Which brings us to our third key
innovation. And this one is all about
perception. I mean, for a robot to act
smart, it can't just see pixels on a
screen. It has to actually understand
the physical world around it. You know,
things like space, physics, cause and
effect. This is a concept they call
embodied reasoning. And this new Gemini
robotics model with embodied reasoning
or gr is just on another level. This
chart shows how it stacks up against
other top tier AI models on really
tricky spatial reasoning tests. And as
you can see, it's not just a little bit
better. It's significantly outperforming
them. It just has a much more intuitive
gut level understanding of where things
are and how they relate to each other in
physical space. And that allows for some
unbelievably complex interactions. Think
about a command like this. Point to all
the objects I can physically pick up if
my payload is 10 lb. To do that, the
robot isn't just identifying objects. It
has to understand the abstract idea of a
payload. Visually estimate the weight of
everything it sees and then make a
judgment call based on its own physical
limits. That level of grounded real
world understanding is an absolute
gamecher.
So let's recap. We've got robots that
can think for themselves, one single
brain that can power many different
bodies, and a deep intuitive
understanding of the physical world. So,
what happens when you put all three of
those breakthroughs together into one
system?
Well, the results are pretty staggering.
Just look at this table. It compares a
more basic AI agent to the full Gemini
robotics system. Planning failures, they
get slashed by nearly 2/3, going from
over 25% down to just nine. and the
total number of failures on a task cut
in half. This is what we mean by
synergy. All the parts working together
make the whole system way, way more
capable. Let's look at a real world task
like sorting trash into different bins,
compost, recycling, landfill. The basic
action model, even with some thinking,
only gets about 40% of the way there.
The baseline agent does better at 64%.
But the full Gemini Robotics agent with
all three innovations firing on all
cylinders hits an impressive 80%
progress score. And this next example is
even crazier. The task is to find foods
that are okay for a vegetarian who also
has a nut allergy. This means the robot
has to use a tool, a web search, to
check ingredients. The first model
literally can't do it a 0% score. The
baseline agent gets about 44% done, but
the full agent with its advanced
reasoning and tool use nails a 78%
progress score. It just shows how
powerful this fully integrated system
really is. Of course, whenever you have
such a massive jump in what a technology
can do, it brings with it an equally
massive responsibility to make sure
these systems are built and used safely.
And that's the billion dollar question,
isn't it? It's not enough to just build
smart robots. We have to build robots
that are safe and reliable. And the
researchers are tackling this head on
with a really clever approach. They're
basically using AI to help make AI
safer. It's a process called auto red
teaming. And you can think of it like a
game being played by three different
AIs. You have an attacker AI that's
constantly coming up with tricky or even
malicious commands to try and trip up
the robot. The Gemini robotics model is
the target getting tested. And then a
third judge AI, the autoraator, scores
the robot's response on whether it was
safe and correct. This system lets the
team find and fix potential problems
automatically and on a massive scale. So
to wrap it all up, this new generation
of robotics is really standing on three
giant pillars. First, there's the
ability to think before acting, which
gives them an inner monologue for
planning and fixing their own mistakes.
Second, the one brain, many bodies idea,
which smashes through the old data
bottleneck and speeds up earning like
crazy. And third, this elite level of
embodied reasoning which gives them a
deep almost intuitive understanding of
the physical world. When you put all of
these breakthroughs together, it really
does mark a fundamental shift. We're
moving away from robots that are just
rigid tools that we program and moving
towards robots that can reason, adapt,
and learn from the world around them. It
opens up a whole universe of new
possibilities. And it leaves us with one
final really fascinating question to
think about. Now that robots can finally
start to understand our world, what's
the first thing we should ask them to
do?