FiS-VLA: Unifying Fast Robotic Manipulation with Slow VLM Reasoning (117Hz Control!)

PaukLTEnm5k • 2025-12-11

FoundationModelsForRobotics YouTube Transcript

Transcript preview

Open

Kind: captions
Language: en
All right, let's get right into it.
We've all seen those AIs that seem
unbelievably smart, right? But then you
try to put that brain into a physical
robot and things just get well, clumsy.
Today, we're going to break down a new
model called fast and slow and it's all
about building a single unified robot
brain that's both brilliant and quick on
its feet. So, why is this even a
problem? Why are the smartest robots so
often the slowest? It really boils down
to this fundamental trade-off. The giant
AI models that can understand complex
commands, the brains of the operation,
they need a lot of processing power.
They literally have to stop and think.
And that creates this lag, this awkward
pause between thought and action that
just doesn't work in the real world. And
this is the robot's dilemma perfectly
laid out. For years, engineers were
stuck with a choice. Do you want a robot
that's a deep thinker? one that can plan
complex tasks but takes forever to do
anything or do you want one that's got
lightning fast reflexes but you know
isn't the sharpest tool in the shed
getting both at the same time that's
been the holy grail in robotics so the
first crack at solving this problem was
basically to divide the brain and the
inspiration for this actually came from
a pretty famous idea in human psychology
that inspiration came from Daniel
Conaman's dual system theory you might
know it from his book thinking fast and
slow the idea is that our own minds have
two modes. System one is that fast,
automatic, gut reaction part of you.
System two is the slow, logical,
effortful part that reasons things out.
It's the difference between
instinctively ducking when a ball flies
at your head versus sitting down to
solve a math problem. So, the old way of
building robots tried to copy this
literally. They take two totally
separate AI models, a big powerful one
for the slow thinking and a small
lightweight one for the fast actions.
And they basically just bolted them
together and hoped for the best. But
this created a huge bottleneck. Here's
how it worked. The big system 2 brain,
usually a massive vision language model
or VLM, would analyze the situation.
Then it would basically pass a summary,
a set of instructions over to the little
system one brain, which would then
generate the action. See the problem?
The fast system was completely cut off
from all the rich knowledge and context
in the main brain. It was acting on
secondhand information, which really
held it back. And that is what brings us
to the breakthrough. Instead of two
separate brains kind of clumsily bolted
together, the fast and slow model
introduces a single unified brain. And
this completely changes the game. The
secret sauce is right here in this quote
from the research paper. The model is
called Fist Va. By the way, instead of
adding a whole separate model, the
researchers did something really clever.
They took the final few layers of the
existing big brain and just repurposed
them. Those last few layers become the
fast reflexive system one while the
entire model is still there to act as
the slow reasoning system two. And this
is just such an elegant idea. It's a
fast system that lives inside the slow
system. They aren't two different things
anymore. They're two parts of a whole
sharing the exact same knowledge, the
same structure. This allows for this
seamless, beautiful coordination between
deep thought and quick reflexes. And
here's how they work together. The big
slow system looks at the big picture. 2D
images, language commands, but it does
this at a lower speed. It's the
strategist. Meanwhile, the little fast
system takes that strategic guidance,
but it also processes a ton of real-time
highfrequency data like the robot's
joint positions and 3D sensor info. And
the key is they run asynchronously at
different speeds, which makes the whole
thing incredibly efficient. Okay, so the
theory sounds amazing, right? It's
elegant. It makes sense. But the real
question is, does it actually work?
Let's check out the data and see how
this new model stacks up against what
came before. Well, in simulations, the
answer is a definite yes. Just look at
this chart. Visa hits a 69% average
success rate. That's a full 8% better
than the previous state-of-the-art COG
ACT. And compared to another leading
model, it's a 14% jump. That's not a
small improvement. That's a really
significant leap. But here's where it
gets really impressive. In the messy,
chaotic, unpredictable real world, Fes
Fiella showed an average success rate
improvement of 11% across a bunch of
tough tasks. Look, making something work
in a clean simulation is one thing.
Getting this kind of boost in reality.
That is a huge deal. And remember, it's
not just more accurate, it's way faster.
This model runs at a control frequency
of nearly 22 hertz. That means it's
making almost 22 decisions every single
second. That's more than double the
speed of some of the older methods. It
really did break that old trade-off
between being smart and being fast. And
when you look at specific, really tricky
tasks, you can see the difference it
makes. Take folding a towel. That's
incredibly hard for a robot because a
towel is a deformable object. It's
floppy and unpredictable. The old model
succeeded 40% of the time. Fisvala 60%.
That's the kind of complex, delicate
work this unified brain makes possible.
So, we've seen the design, we've seen
the impressive results, but let's zoom
out. What does this all really mean for
the future of robotics? I think there
are three really big takeaways here.
First, this unified mind is just a
smarter, more elegant way to design a
robot. Second, it proves you don't have
to choose between speed and smarts
anymore. You can actually have both. And
third, because the whole system shares
the same brain, it gets much better at
generalizing. You know, handling new
objects it's never seen or dealing with
a cluttered room or bad lighting, just
like you and I do every day. When you
get right down to it, this isn't just
another small improvement. It's a really
foundational step towards creating
robots that can finally leave the
sterile lab or the predictable factory
floor. It's about building machines that
aren't just intelligent thinkers, but
are also coordinated, responsive doers
out in our world. And that kind of
leaves us with this one big fascinating
question to think about. If a robot's
mind and body can finally truly work
together in perfect harmony, moving from
slow, careful thought to instant
reflexive action without a hitch, what
are the next big challenges they're
going to solve?

Resume

Berikut adalah rangkuman profesional dari Bagian 1 transkrip yang Anda berikan:

### Judul: Inovasi Robotika: Model "Fast and Slow" untuk Mengatasi Kikuknya Gerakan AI

**Inti Sari**
Video ini membahas tantangan utama dalam penerapan kecerdasan buatan (AI) pada robot fisik, di mana terdapat *trade-off* antara kecepatan dan kecerdasan. Untuk mengatasi masalah ini, sebuah model baru bernama "Fist Va" (juga disebut "Visa" atau "Fes Fiella") diperkenalkan, yang menggabungkan kemampuan berpikir dalam dan refleks cepat dalam satu otak terpadu.

**Poin-Poin Kunci**
*   **Masalah Utama:** AI yang cerdas seringkali menjadi lambat dan kikuk ketika ditempatkan dalam tubuh robot karena beban pemrosesan data yang besar.
*   **Solusi Lama:** Menggunakan dua model terpisah (sistem lambat untuk berpikir dan sistem cepat untuk aksi), namun terkendala pada hambatan komunikasi antara keduanya.
*   **Terobosan Baru:** Pengembangan model "Fast and Slow" (Fist Va) yang menggunakan satu arsitektur otak tunggal namun mampu beroperasi pada dua kecepatan berbeda secara asinkron.
*   **Kinerja:** Model ini berhasil mencapai tingkat keberhasilan 69% dalam simulasi dan menunjukkan kemampuan manipulasi objek yang halus seperti manusia dalam pengujian dunia nyata.

**Rincian Materi**

**1. Tantangan AI dalam Robot Fisik**
Meskipun model AI modern sangat cerdas, penerapannya pada robot fisik sering kali menghasilkan gerakan yang kikuk. Masalah utamanya adalah *trade-off* antara menjadi "pemikir dalam" (yang lambat) atau memiliki "refleks cepat" (yang kurang cerdas). Model AI raksasa membutuhkan daya pemrosesan yang besar, menyebabkan jeda antara pemikiran dan tindakan.

**2. Pendekatan Konvensional: Teori Sistem Ganda**
Sebelumnya, para insinyur menggunakan pendekatan yang terinspirasi oleh buku Daniel Kahneman, *Thinking Fast and Slow*. Mereka menggabungkan dua model AI terpisah:
*   **Sistem 2 (Lambat):** Model besar (seperti VLM) yang bertindak sebagai otak besar untuk menganalisis dan merencanakan.
*   **Sistem 1 (Cepat):** Model kecil yang bertindak sebagai refleks untuk tindakan cepat.
*   **Kendala:** Sistem cepat terputus dari pengetahuan kaya yang dimiliki sistem lambat. Sistem lambat harus meringkas informasi dan meneruskannya ke sistem cepat, menciptakan hambatan (*bottleneck*) informasi.

**3. Model Baru: "Fist Va" (Fast and Slow)**
Sebuah model baru yang disebut "Fist Va" (dalam transkrip juga muncul variasi sebutan "Visa" atau "Fes Fiella") dikembangkan untuk mengatasi masalah tersebut.
*   **Satu Otak Terpadu:** Model ini tidak memisahkan otak menjadi dua unit yang terpisah secara total, melainkan menggunakan satu jaringan saraf tunggal.
*   **Mekanisme Kerja:** Lapisan-lapisan akhir dari model besar yang ada dimanfaatkan kembali untuk berfungsi sebagai sistem refleks cepat (Sistem 1), sementara seluruh model tetap berfungsi sebagai sistem penalaran lambat (Sistem 2).
*   **Berbagi Pengetahuan:** Karena keduanya adalah bagian dari struktur yang sama, sistem cepat memiliki akses langsung ke pengetahuan dan representasi dunia yang dimiliki sistem lambat.

**4. Operasi Asinkron dan Pengujian**
*   **Cara Kerja:** Sistem lambat bertindak sebagai strategis yang memproses gambar 2D dan bahasa pada kecepatan rendah. Sistem cepat menerima panduan dari sistem lambat dan memproses data sensor frekuensi tinggi secara real-time (seperti posisi sendi dan info sensor 3D). Keduanya berjalan pada kecepatan yang berbeda secara asinkron.
*   **Hasil Simulasi:** Dalam simulasi, model "Visa" mencapai tingkat keberhasilan sebesar 69%, melampaui metode sebelumnya.
*   **Pengujian Dunia Nyata:** Saat diuji pada lengan robot fisik, model ini mampu melakukan tugas-tugas seperti melipat laundry dan mengambil benda dengan gerakan yang halus dan mirip manusia.

**Kesimpulan & Pesan Penutup**
Model "Fast and Slow" atau "Fist Va" merepresentasikan lompatan signifikan dalam robotika dengan menghilangkan hambatan komunikasi antara pemikiran dan refleks. Dengan menggunakan satu otak terpadu yang mampu beralih antara penalaran mendalam dan reaksi cepat, robot menjadi lebih lincah, cerdas, dan mampu meniru kelancaran gerakan manusia dalam tugas-tugas sehari-hari.

Read

file updated 2026-02-12 02:45:05 UTC