FiS-VLA: Unifying Fast Robotic Manipulation with Slow VLM Reasoning (117Hz Control!)
PaukLTEnm5k • 2025-12-11
Transcript preview
Open
Kind: captions Language: en All right, let's get right into it. We've all seen those AIs that seem unbelievably smart, right? But then you try to put that brain into a physical robot and things just get well, clumsy. Today, we're going to break down a new model called fast and slow and it's all about building a single unified robot brain that's both brilliant and quick on its feet. So, why is this even a problem? Why are the smartest robots so often the slowest? It really boils down to this fundamental trade-off. The giant AI models that can understand complex commands, the brains of the operation, they need a lot of processing power. They literally have to stop and think. And that creates this lag, this awkward pause between thought and action that just doesn't work in the real world. And this is the robot's dilemma perfectly laid out. For years, engineers were stuck with a choice. Do you want a robot that's a deep thinker? one that can plan complex tasks but takes forever to do anything or do you want one that's got lightning fast reflexes but you know isn't the sharpest tool in the shed getting both at the same time that's been the holy grail in robotics so the first crack at solving this problem was basically to divide the brain and the inspiration for this actually came from a pretty famous idea in human psychology that inspiration came from Daniel Conaman's dual system theory you might know it from his book thinking fast and slow the idea is that our own minds have two modes. System one is that fast, automatic, gut reaction part of you. System two is the slow, logical, effortful part that reasons things out. It's the difference between instinctively ducking when a ball flies at your head versus sitting down to solve a math problem. So, the old way of building robots tried to copy this literally. They take two totally separate AI models, a big powerful one for the slow thinking and a small lightweight one for the fast actions. And they basically just bolted them together and hoped for the best. But this created a huge bottleneck. Here's how it worked. The big system 2 brain, usually a massive vision language model or VLM, would analyze the situation. Then it would basically pass a summary, a set of instructions over to the little system one brain, which would then generate the action. See the problem? The fast system was completely cut off from all the rich knowledge and context in the main brain. It was acting on secondhand information, which really held it back. And that is what brings us to the breakthrough. Instead of two separate brains kind of clumsily bolted together, the fast and slow model introduces a single unified brain. And this completely changes the game. The secret sauce is right here in this quote from the research paper. The model is called Fist Va. By the way, instead of adding a whole separate model, the researchers did something really clever. They took the final few layers of the existing big brain and just repurposed them. Those last few layers become the fast reflexive system one while the entire model is still there to act as the slow reasoning system two. And this is just such an elegant idea. It's a fast system that lives inside the slow system. They aren't two different things anymore. They're two parts of a whole sharing the exact same knowledge, the same structure. This allows for this seamless, beautiful coordination between deep thought and quick reflexes. And here's how they work together. The big slow system looks at the big picture. 2D images, language commands, but it does this at a lower speed. It's the strategist. Meanwhile, the little fast system takes that strategic guidance, but it also processes a ton of real-time highfrequency data like the robot's joint positions and 3D sensor info. And the key is they run asynchronously at different speeds, which makes the whole thing incredibly efficient. Okay, so the theory sounds amazing, right? It's elegant. It makes sense. But the real question is, does it actually work? Let's check out the data and see how this new model stacks up against what came before. Well, in simulations, the answer is a definite yes. Just look at this chart. Visa hits a 69% average success rate. That's a full 8% better than the previous state-of-the-art COG ACT. And compared to another leading model, it's a 14% jump. That's not a small improvement. That's a really significant leap. But here's where it gets really impressive. In the messy, chaotic, unpredictable real world, Fes Fiella showed an average success rate improvement of 11% across a bunch of tough tasks. Look, making something work in a clean simulation is one thing. Getting this kind of boost in reality. That is a huge deal. And remember, it's not just more accurate, it's way faster. This model runs at a control frequency of nearly 22 hertz. That means it's making almost 22 decisions every single second. That's more than double the speed of some of the older methods. It really did break that old trade-off between being smart and being fast. And when you look at specific, really tricky tasks, you can see the difference it makes. Take folding a towel. That's incredibly hard for a robot because a towel is a deformable object. It's floppy and unpredictable. The old model succeeded 40% of the time. Fisvala 60%. That's the kind of complex, delicate work this unified brain makes possible. So, we've seen the design, we've seen the impressive results, but let's zoom out. What does this all really mean for the future of robotics? I think there are three really big takeaways here. First, this unified mind is just a smarter, more elegant way to design a robot. Second, it proves you don't have to choose between speed and smarts anymore. You can actually have both. And third, because the whole system shares the same brain, it gets much better at generalizing. You know, handling new objects it's never seen or dealing with a cluttered room or bad lighting, just like you and I do every day. When you get right down to it, this isn't just another small improvement. It's a really foundational step towards creating robots that can finally leave the sterile lab or the predictable factory floor. It's about building machines that aren't just intelligent thinkers, but are also coordinated, responsive doers out in our world. And that kind of leaves us with this one big fascinating question to think about. If a robot's mind and body can finally truly work together in perfect harmony, moving from slow, careful thought to instant reflexive action without a hitch, what are the next big challenges they're going to solve?
Resume
Categories