Transcript

Ps6q3-kdVlo • Solving the Robotic Horizon Trade-off: Mixture of Horizons (MoH) Explained
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/FoundationModelsForRobotics/.shards/text-0001.zst#text/0052_Ps6q3-kdVlo.txt
Back Raw
Kind: captions
Language: en
You know that feeling when you're
working on something and you have to
zoom in on the tiny little details, but
at the same time, you can't lose sight
of the big picture. It's a balancing act
we do every day. Well, it turns out that
exact same problem is a massive hurdle
for robots. Today, we are going deep,
and I mean deep, into a really
groundbreaking paper that tackles this
head-on. It proposes a solution that's
so smart, so elegant, it's almost like
giving a robot two minds that work
together in perfect harmony. We're
talking about mixture of horizons in
action chunking. All right, so here's
the plan for our journey. First, we're
going to unpack the robot's dilemma to
see what the problem really is. Then,
we'll get to that what if moment, the
spark that led to the solution. From
there, we'll pop the hood and look under
the hood of Moff. We'll see how it
performs in the virtual proven ground,
and then discover how it makes robots
smarter, faster, and realer. And
finally, we'll wrap it all up with a big
takeaway. Let's jump right in. Okay, so
before we can really appreciate just how
clever this solution is, we have to get
a solid grip on the problem itself. And
this isn't some minor issue. It's a
fundamental challenge that sits right at
the core of how modern robots think and
act. So at the heart of all this are
what we call vision language action
models or VLAs for short. The easiest
way to think about these is as the
brains of the operation. A VLA model
takes in everything the robot sees with
its cameras. It understands a command
you give it in plain English, like,
"Hey, grab the red block." And then it
has to figure out the exact sequence of
movements to make that happen. Now, to
be efficient, these robots don't just
plan one tiny movement at a time. That
would be way too slow. Instead, they use
a strategy called action chunking, where
they predict a whole sequence of future
actions all in one go. And this right
here brings us to the absolute crux of
the problem. How far into the future
should that plan go? Should the robot
map out its next 10 moves or maybe the
next 30? This length, how far out it
plans, is called the horizon. And what
this research makes clear is that
picking the right horizon is
unbelievably important and also
unbelievably difficult. Because if you
just pick one fixed number, you are
almost always making a compromise. Let's
make this super clear with an analogy.
Think about driving a car. A long
horizon. That's you looking way, way
down the highway. You're thinking about
which lane you need to be in for your
exit that's a mile away. You're planning
your overall route. This is amazing for
long-term strategy, right? It's perfect
for a robot that needs to say open a
cabinet, find a cup, and then pour water
into it. But then there's the short
horizon. That's you watching the road
literally right in front of your tires.
You're focused on a super delicate move
like parallel parking in a tight spot.
You need insane precision, fine grain
control. A good driver does both of
these things without thinking. But for a
long time, VA models have been forced to
pick one. They were either a highway
strategist or a precision parker, but
they couldn't be both. And look, this
isn't just a thought experiment. The
paper lays out the data to prove it.
What this chart is showing us is that
trade-off in black and white using a
standard robotics test called Libro.
They ran the numbers. When they set the
robot to use a long horizon, planning 30
steps ahead, it was great at complex
long tasks. But its performance on
spatial tasks that need that fine-tuned
precision, it went down. And you guessed
it, the reverse was true. When they used
a short horizon of just 10 steps, it
nailed the spatial tasks, but then
fumbled on the long-term ones. This is
the smoking gun. It proves with data
that no single choice is the best for
every situation. You're always giving
something up. So, it's this really clear
databacked problem that sets the stage
for the breakthrough. It forces the
researchers to ask a really simple but
incredibly powerful question. You know
that kind of what if moment that really
pushes science forward. And here it is.
This is the question. What if the robot
didn't have to choose? What if we could
somehow build a single model that could
think with multiple horizons at the same
time? A model that could have the
foresight of that highway driver and the
precision of the city parker all at
once. Getting the best of both worlds.
The answer of course is mixture of
horizons or m. And this is where the
solution is just so elegant. It's not
some brand new crazy complicated robot
brain that you have to build from
scratch. The paper calls it a
plug-and-play strategy. You can take an
existing powerful model and just add Mo
to it. And this is so important. It does
this with almost no extra computational
cost. It's not going to slow down
training. It's not going to slow down
the robot when it's running. It's
designed to be a lightweight, powerful
upgrade, not a total rewrite. Okay, so
that sounds great in theory, but how
does it actually work? I mean, how do
you get one model to think on multiple
different time scales without just
turning into a confused mess? All right,
let's get into the nitty-gritty of how
Mohhatch actually functions. You can
really break the whole thing down into
four main steps. First, the system takes
a long-term plan and basically
rearranges it into several shorter plans
of different lengths. These are our
different horizons. Second, and this is
the key to making it so efficient, it
processes all of these horizons at the
exact same time in parallel. Third, it
uses a super simple but really effective
little gating tool to intelligently mix
or fuse the predictions from each
horizon. And fourth, it uses a clever
trick called a balance loss to make sure
that no single horizon gets to be the
boss and make sure all of them have a
voice. So, let's zoom in on those first
two steps. Imagine the robot's main plan
is to think 30 steps ahead. Well, the
Mohitch system doesn't just look at that
one plan. It simultaneously creates and
looks at a 10-step version of that plan,
a 20step version, and the full 30-step
version. The real magic trick here is
that the part of the robot's brain that
thinks about actions, the action
transformer, is shared. It looks at all
three of these plans in one single
efficient pass. And because that part of
the AI is pretty small compared to the
giant part that handles vision, this
whole process adds practically zero
extra time. It's a ridiculously
efficient way to get multiple points of
view. So now you have these three
different plans. How do you combine
them? This is step three. And I love
this part. The researchers basically
said, let's follow AAM's razor, which is
the principle that the simplest
explanation is usually the right one. So
instead of building some huge complex
network to combine the plans, they use a
tiny little gating mechanism, we're
talking about just 2,000 parameters, in
the world of AI models with billions of
parameters that's like a grain of sand.
This tiny gate just learns to assign
weights. It decides for the specific
moment, should I listen more to the
short-term plan or the long-term one?
It's like a DJ at a mixing board,
constantly tweaking the levels of each
voice to create the perfect final
action. And that brings us to the final
crucial piece of the puzzle. Step four,
keeping things balanced. See, if you
just let the system learn on its own,
that little gating network might get
lazy. It might just figure out, hey, the
30-step planner is pretty good most of
the time. I'll just listen to it. And
that would completely defeat the purpose
of having multiple horizons. So, the
researchers added what they call a
balance loss. Just think of it like a
penalty during training. If the model
starts ignoring one of the planners,
this loss function kicks in and says,
"Nope, you have to pay attention to
everyone." It's like a coach making sure
every player on the team gets a chance
to handle the ball. This forces the
model to actually learn how to use the
unique strengths of every single time
scale. So, the theory is sound, the
mechanics are clever and efficient, but
the real question is, does it actually
work? To find out, the researchers threw
Moth into the ring against some of the
toughest simulated robot challenges out
there. So, let's see how it did. Okay,
this table pretty much says it all. They
took an existing state-of-the-art model
called PI0.5, and this thing was already
a beast, scoring a 97.7% average success
rate. Then they just plugged in their
Mohawk strategy. The results, they're
just wow. On tasks involving objects, it
went from 99% to a flawless 100%. On
goal oriented tasks, it jumped from 97.6
to 98.8. But look at the biggest leap in
that long category. The really complex
multi-stuff stuff we talked about. It
shot up from 95.4% all the way to 98.4%.
Moha didn't just give it a little boost.
It took an already S tier model and
kicked it into a whole new league. And
all those individual gains added up,
pushing the total average success rate
to a staggering 99%.
In a benchmark with this much variety
and difficulty, hitting a 99% average is
well, it's the new state-of-the-art.
It's just a massive validation of the
whole idea that solving this precision
versus foresight trade-off unlocks a new
level of performance. We're just getting
to some of the really mind-blowing
applications of this tech, but if you're
finding this level of detail useful and
you want to keep up with these kinds of
huge leaps in AI and robotics, this is a
great time to hit that subscribe button.
We go this deep on new papers every
single week. Now, 99% is a big
impressive number, but it's still a bit
abstract, right? So, let's talk about
what the robot was actually doing to get
that score. We're not talking about just
pick up the block. No, these are complex
multi-art commands that demand both
long-range planning and pinpoint
accuracy. Things like put both mocha
pots on the stove or put both the soup
and the cream cheese in the basket or
even open the top drawer and then put
the bowl inside. Successfully doing that
stuff over and over again is what a 99%
success rate actually looks like. It's
the robot understanding a complex goal
and just nailing every single step. And
just to prove this wasn't some oneoff
success on a single benchmark, they also
tested motion on a completely different
one called Robo2.0. This one is all
about tasks that require two robot arms
working together. And the story was
exactly the same. The base model did
okay, but as soon as they plugged in
MOH, the success rate jumped up
significantly. This proves that the core
idea is general. It doesn't just work
for one robot or one set of tasks. It's
a fundamental improvement that makes
robots more capable and more robust. Per
Okay, so setting a new record on a
benchmark is awesome, for sure. But what
gets me really excited about Mixture of
Horizons is that it opens the door to
totally new abilities. It doesn't just
make the robot a little bit better, it
makes it fundamentally smarter, way more
efficient, and a lot more prepared for
the chaos of the real world. And that
brings us to maybe the coolest thing to
come out of this paper, something they
call dynamic inference. So, because the
robot is constantly getting predictions
from its short, medium, and long-term
planners, it can basically check to see
if they all agree. Think of it like a
committee meeting. The 10-step plan, the
20step plan, and the 30-step plan all
cast a vote on the best next move. If
they all vote the same way, that's a
strong cross horizon consensus. The
robot is confident. But if they
disagree, that's a sign of uncertainty.
So for the first time, the robot can
decide for itself in real time how many
steps to take. It only commits to the
actions that have broad agreement across
its minds. The result of this is
behavior that just looks incredibly
intelligent. When the robot gets to a
tricky part of a task, like actually
grasping a weirdly shaped object, the
different planners might disagree a
little bit. So the robot automatically
becomes more cautious. it'll only
execute a few steps before stopping to
replan. But then when it's doing
something easy like just moving its arm
through empty space, all the planners
will be in total agreement. In that
case, the robot gets a boost of
confidence and executes a much longer
sequence of actions, moving quickly and
efficiently. It literally adapts its own
caution level based on its internal
sense of certainty. It's amazing. And
this adaptive behavior isn't just
smarter, it is way faster. So much
faster, in fact, that this dynamic
approach can boost the robot's
throughput by up to 2.5 times. That
means it can get its jobs done in a
fraction of the time. And the craziest
part, it achieves this speed up while
still performing better than the old
slower method. It's that rare, beautiful
win-win scenario. It's both better and
faster. Look, simulations are fantastic,
but the ultimate test, the final boss,
is always the real world. So, the team
got a physical seven degree of freedom
robot arm, set it up with a gripper and
some cameras, and gave it a mix of
tasks. Some were designed to test that
short-term precision, like carefully
putting a piece of bread into a bowl,
and others were designed to test
long-term planning, like putting a pen
in a drawer and then remembering to
close it. And wouldn't you know it, the
results in the real world perfectly
matched what they saw in the
simulations.
What's so compelling here is just how
consistent it is. It didn't matter if it
was the bread task, the milk task, or
the more complex drawer task across the
board for both of the models they
tested. Just plugging in Moage made the
robot succeed more often. The paper even
notes that the Moage robot was more
decisive. It didn't hesitate as much and
its grasps were quicker and more
confident, which led to it completing
tasks faster and more reliably.
Honestly, seeing these ideas jump from
the pages of a research paper into a
real physical robot moving around and
actually manipulating things in the
world, that's the magic right there.
That's why we love breaking this stuff
down for you. If you want to join us for
the next time we see something this
cool, the best thing you can do is hit
that subscribe button. All right, we've
gone through the problem, we've gone
through the method, and we've seen the
incredible results. So, let's zoom out.
What are the main things you should walk
away with? What's the big takeaway from
this whole Mixture of Horizons paper? I
think it really boils down to three huge
contributions. First, they didn't just
have a hunch. They systematically proved
that this critical trade-off between
long-term strategy and short-term
precision was a real bottleneck for
modern robots. Second, they created
Moitch, which is this beautifully simple
plug-and-play, lowcost solution that
directly attacks and solves that
trade-off. And third, they showed how
Moish makes something totally new
possible. This dynamic inference that
lets robots adapt their own behavior for
smarter, more stable, and more efficient
action. They didn't just point out a
problem. They delivered an elegant
solution that unlocked a whole new level
of capability. And that leaves us with
one last really big question to chew on.
This research is about so much more than
just a better score on a test. By giving
a robot the power to think on multiple
time scales at the same time, to
constantly weigh its immediate, precise
movements against its long-term goals,
we have to wonder, are we seeing the
very first steps towards robots that
don't just blindly execute our commands,
but can actually begin to truly
strategize? That's the idea this work
leaves us with. Thanks so much for
diving in with me.