Transcript
Ft8G7tH9IUo • Gemini Robotics 1.5: Thinking Robots & Zero-Shot Skill Transfer across embodiments
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/FoundationModelsForRobotics/.shards/text-0001.zst#text/0020_Ft8G7tH9IUo.txt
Kind: captions Language: en All right, let's dive into something that feels like it's straight out of science fiction, but it's happening right now. A new technical report just dropped, and honestly, it points to a massive leap forward in robotics. We're not talking about a small update here. This is a complete reimagination of how robots can learn, think, and actually operate in the real world. So, in this explainer, we're going to break down the three huge innovations that are making this all possible. To really get why this is such a big deal, you got to ask this one simple question. For years, we've had robots that are fantastic at doing things. You give them a script, they follow it perfectly. But the problem is they haven't been very good at thinking. What happens when something unexpected occurs? When the world doesn't follow the script, well, that's where things usually fall apart. And this slide really nails the difference. On the left, you've got the old way. Robots that are rigid, programmed for just one job. If you want to change the task, you pretty much have to start over from square one, retraining it for every single new thing. But on the right, that's the future we're talking about. A robot that's adaptive, that can reason, and get this, can learn skills and apply them across different kinds of robot bodies. So, how in the world do they pull that off? Well, the first major breakthrough is giving robots something like an inner monologue, a way for them to actually think through a problem before they even make a move. The official term for this is embodied thinking. But what it really means is that the robot can take a big complicated command from a human and on its own break it down into a bunch of smaller simpler steps. You know, it's the difference between just blindly following instructions and actually coming up with a plan. Let's make this concrete cuz this example from the report is perfect. Imagine you say, "Hey, pack my suitcase for a trip to London." A normal robot would just sit there waiting for you to list every single item. But this system, its orchestrator can actually use a tool like a web search to check the weather in London. It sees, oh, it's probably going to rain, so it makes a plan. Okay, I need to pack the rain jacket. Then the action model takes that simple idea and turns it into all the physical motions needed to actually do it. But here's the thing. The true test of intelligence isn't just about following a plan when everything goes perfectly. It's what you do when things go wrong. Because in the real world, things always go wrong. And here we see a perfect example of that. The Apollo humanoid robot messing up while trying to handle a water bottle. And this this is where the magic happens. Thanks to embodied thinking, the robot doesn't just freeze up or give you an error message. Its inner monologue adapts in real time. It recognizes the mistake. Oops, the bottle slipped. And immediately thinks up a new plan. Okay, let's try picking it up with the other hand. Being able to recover from mistakes on the fly like this is a huge, huge step towards making robots we can actually rely on in messy, unpredictable environments. Okay, so that's the first mind-blowing idea. Now for the second one, which solves a problem that has held back robotics for decades, the data bottleneck. I mean, how do you get enough data to train a robot without spending years and years collecting it for just one specific machine? Well, the answer turns out to be brilliantly simple. You don't. You train one single brain that can control lots of different bodies. So, here we have three totally different robots. You've got Aloha, which is this tabletop system. Then there's Francoa, a super precise industrial arm, and of course, Apollo, the full-on humanoid. And the amazing part is the exact same AI model, the same brain is running all three of them without having to be retrained for each one. And get this, the data shows that this approach doesn't just make training faster, it actually makes the model smarter. Look at this. When you train the AI on data from just one type of robot, its performance score is about 54%. But when you train it on data from all the different robots, its performance on that original robot jumps all the way to 76%. and even higher with the full system. That's the crazy insight here. Skills learned on a humanoid robot can actually make a little tabletop robot better and vice versa. It's a shared pool of knowledge. Which brings us to our third key innovation. And this one is all about perception. I mean, for a robot to act smart, it can't just see pixels on a screen. It has to actually understand the physical world around it. You know, things like space, physics, cause and effect. This is a concept they call embodied reasoning. And this new Gemini robotics model with embodied reasoning or gr is just on another level. This chart shows how it stacks up against other top tier AI models on really tricky spatial reasoning tests. And as you can see, it's not just a little bit better. It's significantly outperforming them. It just has a much more intuitive gut level understanding of where things are and how they relate to each other in physical space. And that allows for some unbelievably complex interactions. Think about a command like this. Point to all the objects I can physically pick up if my payload is 10 lb. To do that, the robot isn't just identifying objects. It has to understand the abstract idea of a payload. Visually estimate the weight of everything it sees and then make a judgment call based on its own physical limits. That level of grounded real world understanding is an absolute gamecher. So let's recap. We've got robots that can think for themselves, one single brain that can power many different bodies, and a deep intuitive understanding of the physical world. So, what happens when you put all three of those breakthroughs together into one system? Well, the results are pretty staggering. Just look at this table. It compares a more basic AI agent to the full Gemini robotics system. Planning failures, they get slashed by nearly 2/3, going from over 25% down to just nine. and the total number of failures on a task cut in half. This is what we mean by synergy. All the parts working together make the whole system way, way more capable. Let's look at a real world task like sorting trash into different bins, compost, recycling, landfill. The basic action model, even with some thinking, only gets about 40% of the way there. The baseline agent does better at 64%. But the full Gemini Robotics agent with all three innovations firing on all cylinders hits an impressive 80% progress score. And this next example is even crazier. The task is to find foods that are okay for a vegetarian who also has a nut allergy. This means the robot has to use a tool, a web search, to check ingredients. The first model literally can't do it a 0% score. The baseline agent gets about 44% done, but the full agent with its advanced reasoning and tool use nails a 78% progress score. It just shows how powerful this fully integrated system really is. Of course, whenever you have such a massive jump in what a technology can do, it brings with it an equally massive responsibility to make sure these systems are built and used safely. And that's the billion dollar question, isn't it? It's not enough to just build smart robots. We have to build robots that are safe and reliable. And the researchers are tackling this head on with a really clever approach. They're basically using AI to help make AI safer. It's a process called auto red teaming. And you can think of it like a game being played by three different AIs. You have an attacker AI that's constantly coming up with tricky or even malicious commands to try and trip up the robot. The Gemini robotics model is the target getting tested. And then a third judge AI, the autoraator, scores the robot's response on whether it was safe and correct. This system lets the team find and fix potential problems automatically and on a massive scale. So to wrap it all up, this new generation of robotics is really standing on three giant pillars. First, there's the ability to think before acting, which gives them an inner monologue for planning and fixing their own mistakes. Second, the one brain, many bodies idea, which smashes through the old data bottleneck and speeds up earning like crazy. And third, this elite level of embodied reasoning which gives them a deep almost intuitive understanding of the physical world. When you put all of these breakthroughs together, it really does mark a fundamental shift. We're moving away from robots that are just rigid tools that we program and moving towards robots that can reason, adapt, and learn from the world around them. It opens up a whole universe of new possibilities. And it leaves us with one final really fascinating question to think about. Now that robots can finally start to understand our world, what's the first thing we should ask them to do?