Transcript
Jt-m3gho0_0 • V-JEPA & V-JEPA 2 Explained: The Self-Supervised Revolution in Video Understanding
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/FoundationModelsForRobotics/.shards/text-0001.zst#text/0053_Jt-m3gho0_0.txt
Kind: captions
Language: en
Today we are diving deep into a
fascinating new model for Meta AI called
VGPAT 2. And really it's a huge step
towards solving one of the biggest
challenges in AI. All starting with a
question that sounds simple but is
actually incredibly profound. So think
about how a baby learns about the world.
You know, they're not sitting down with
textbooks on physics or memorizing
equations. They just watch. They see a
toy fall off a high chair over and over
and over again. They push a block. They
see it slide. And through nothing but
pure observation, they start to build
this intuitive model of reality, a kind
of physical common sense that lets them
navigate the world. We totally take it
for granted. But for an AI, that's been
the holy grail. So the big question, the
mission behind Vija 2 is, can we build a
machine that does the same thing that
learns how the world works not by being
programmed, but simply by watch. All
right, so here's the plan for our deep
dive today. First, we're going to tackle
what's called the AI common sense
problem. basically why this is so tough
for machines. Then we'll get into the
really clever solution at the heart of
this research, a totally different way
to predict things called Jeppa. After
that, we'll follow the AI through its
two-part education. Phase one is where
it basically binge watches the internet.
No, seriously. And in phase two, it
learns to actually do things. Then comes
the fun part. We'll see this AI power
real world robots with some pretty
incredible new skills. And finally,
we'll zoom out and talk about what this
all means for the future. So, let's kick
things off with that fundamental
challenge. Look, we've had AI for
decades that can do amazing things,
right? It can be grand masters at chess.
It can fold proteins, but those are all
digital worlds with clear rules. Taking
that AI and putting it into our messy,
unpredictable physical world, that has
been ridiculously hard. To be a useful
helper, an AI needs more than just
pattern recognition. It needs an
intuitive grasp of reality, cause and
effect, how objects behave. It needs,
for lack of a better word, common sense.
You and I, we do this every second of
every day. We're all walking around with
this incredibly sophisticated simulation
of the world running in our heads. We
call it an internal world model. You see
a coffee mug sitting a little too close
to the edge of your desk. Your world
model instantly runs a simulation. It
predicts it's going to fall, it's going
to speed up, it's going to hit the
floor, and it's probably going to make a
huge mess. You don't sit there and
calculate the physics. You just know.
And that's what lets us understand
what's happening, predict what's about
to happen, and plan our actions. Giving
an AI a world model that powerful, well,
that's the ultimate goal. And hey, this
isn't some brand new idea. It's a vision
that pioneers in the field like Yan Lun
have been talking about for a while. As
he put it, the real challenge is to get
AI to learn and act largely by
observation. And why is observation so
key? Because the world itself is the
ultimate data set. Every video on the
internet is packed with information
about physics, objects, and causality.
The secret, according to this line of
thinking, is to build models that can
tap into this gigantic unlabeled library
and figure out the structure of the
world all by themselves. You know, just
like a kid does. So, if the idea is so
simple, what's the catch? Why has this
been so hard? Well, there have been
three massive roadblocks. First, there's
a data problem. There are zillions of
videos of the world, but there's almost
no data of robots actually interacting
with the world. Getting that kind of
data is slow and expensive. Second, the
cost is insane. Early ideas for world
models tried to get the AI to predict
the future pixel by pixel. That's like
trying to paint a photorealistic movie
of what's going to happen next. It's not
just computationally impossible. It's
also a waste of time. It forces the AI
to worry about tiny unpredictable
details like the exact shimmer of light
on a chrome toaster. And third, these
models were terrible at generalizing.
You could train a robot to pick up a red
block in a lab, but if you showed it a
blue cup in a slightly different
kitchen, it would just completely fail.
It memorized a task. It didn't
understand it. So, to get around these
huge hurdles, the researchers at Meta AI
came up with a fundamentally smarter way
for an AI to learn to predict the
future. And that brings us to the core
innovation here, the joint embedding
predictive architecture or for short
JPA. And trust me, it's a total shift in
how we think about teaching machines to
see. So just so we're all on the same
page, when we say world model from here
on out, this is exactly what we mean.
It's the AI's own internal simulation of
reality. It's not programmed. It's
learned from data like video. And its
whole job is to let the AI understand
the world, predict what happens next,
and most importantly, plan how to act.
Okay, this slide right here is the
absolute key to understanding this whole
thing. On the one hand, you have the old
way, generative models. They learn about
the world by trying to predict every
single pixel. Imagine a video of a cat
walking behind a couch. A generative
model tries to predict the exact texture
of the cat's fur, the specific glint in
its eye, the pattern on the wallpaper.
It's trying to be a photorealistic
painter, but most of that stuff is
random, unpredictable, and frankly
irrelevant to understanding CAT. It's
incredibly inefficient. But then you
have JPA. It takes a totally different
path. It doesn't care about pixels. It
works in an abstract representation
space. Basically, a mathematical space
where the meaning or the idea of the cat
is captured, not its literal pixels. The
police sketch artist analogy is perfect.
The artist doesn't draw every single
pore on a person's face. They capture
the essence, the shape of the nose, the
eyes, the highle features that actually
matter for recognition. Japa does
exactly that, but for reality itself. So
if you remember just one thing from this
section, make it this. JPA builds its
world model by learning to predict the
abstract idea of what's coming next, not
by trying to paint a perfect picture of
it. This forces the model to just ignore
all the visual static and focus only on
the stuff that's actually predictable
about how the world works. And that
makes it way more efficient and leads to
a much deeper understanding of things.
This is such a powerful concept, right?
It's the foundation for everything else
we're about to talk about. And hey, if
you love getting into the weeds on big
ideas like this in AI, this is what we
do. So, hitting that subscribe button is
the best way to make sure you don't miss
our next deep dive. Okay, so we've got
the theory down. This elegant idea of
predicting ideas, not pixels. Now, let's
see how they actually put it into
practice. The journey to create VJPA 2
kicks off with phase one. Think of this
as the observational learning phase, but
on a scale that is just hard to wrap
your head around. The goal here is
simple. Build that foundational
understanding of the world by having the
AI do nothing but watch videos. We are
talking about the ultimate epic binge
watch of the entire internet. So, how
does it learn anything if no one is
telling it what it's looking at? Well,
it uses this really elegant process
called self-supervised learning. It
basically makes up its own little
puzzles to solve millions of times a
second. First, it takes a video and kind
of chops it up into a grid of patches
like a mosaic. Then, and this is the
key, it just randomly hides huge chunks
of the video. Just blacks them out.
Then, a part of the model called the
encoder looks at all the visible parts
and turns them into those abstract
representations we were just talking
about. And finally, another part, the
predictor, has to solve the puzzle.
Based on what it can see, it has to
guess the abstract idea of what's in the
hidden parts. By playing this game of
digital peekaboo with itself over and
over, it's forced to learn the rules of
the world. It learns that if you see a
ball here, its representation is
probably going to be over there a second
later. It learns about gravity and
momentum without a single human ever
telling it a thing. And when I say it
plays this game a lot, I mean a lot.
Vapatu was pre-trained on over 1 million
hours of video from the internet. To put
that into perspective, for you to watch
that much video, you would have to sit
there 24/7 for more than 114 years. It
is that immense amount of data that lets
it build such a solid general
understanding of how our world really
works. And this wasn't just a random
grabag of cat videos. The training data
was a specific mix designed to give the
model a really well-rounded education.
You can see it includes things like ego
video, which is footage from a
firstperson view, like a GoPro, that
teaches it what the world looks like
when you're actually in it. And then
there are massive amounts of exo video,
which is the normal third person stuff
from YouTube that's crucial for
understanding how things interact. It
even watched tons of how-to videos. It's
this variety that gives it a 360° view
of reality. The research also really
drives home a key lesson in AI today.
Scale matters a lot. This chart is
awesome. It shows how much better the
model got with each scaling ingredient.
Just going from a smaller data set to 22
million videos gave it a solid onepoint
accuracy boost. Then making the model
itself bigger, scaling it up to over a
billion parameters added another point
and a half. And finally, just letting it
train for longer on higher res video
added another 1 and a half. Every step
up gave a clear, measurable improvement.
When it comes to learning about the
world, bigger is definitely better.
Okay, so at the end of phase 1, we have
an AI that is an absolute expert
observer. It's seen 114 years of
reality. It has this deep intuitive feel
for physics, but it's completely
passive. It knows what happens when a
ball falls, but it has no idea that it
could cause a ball to fall. And that is
where phase 2 comes in. This is where
the model learns to act. It's the step
where VGA 2 becomes VGA 2 AC. And that
AC stands for action condition. And this
cooking analogy just nails the
difference. Phase one is like watching
every single episode of every cooking
show ever made. You know what all the
ingredients look like. You've seen all
the techniques. You understand how a
recipe works. You've got all this
amazing theoretical knowledge, but
you've never actually held a knife.
Phase 2 is like finally getting to spend
a few hours in a real kitchen. You
actually pick up the knife. You feel the
resistance of chopping an onion. You
feel the heat from the stove. It's where
all that passive knowledge gets grounded
in the real physical world of cause and
effect. And here's what's absolutely
nuts. After learning from over a million
hours of video, VJPAT 2 only needed 62
hours of robot data to learn how to act.
That's it. Less than 3 days worth of
watching an unlabeled robot arm move
around. That tiny amount of hands-on
experience was all it took to connect
its vast world knowledge to the reality
of a physical body. And that points to a
superefficient way to teach robots new
skill. So, here's the clever part.
Technically, they take that super smart
video encoder from phase 1 and they
freeze it. All its knowledge is locked
in. Then, they train a new specialized
predictor. And this one gets two pieces
of information. The current state of the
world and a potential action the robot
could take. Its whole job is to answer
the question, if the world looks like
this now, and the robot does this, what
will the world's abstract idea look like
a moment later? And crucially, they
train it to predict not just the very
next step, but multiple steps into the
future. That makes its long-term
forecasts way more stable and stops tiny
prediction errors from spiraling out of
control. All right, the training is
done. The model has watched the
internet. It's gotten its hands-on
experience, and now it has an action
conditioned world model. This is the
moment of truth. Can we take this
digital brain, put it in a real robot,
and have it do things it has never ever
been explicitly trained for? It's time
to put it to the test. And this is where
we get to the magic of zero shot
control. This is the ultimate final exam
for any robotics AI. It means you take
your fully trained model, put it on a
robot in a lab it's never seen with
objects it's never touched, and you give
it a task without any new training. It
has to rely completely on its
generalized world model to figure out
what to do. This is the true test of
whether it actually understands the
world or if it just memorized a bunch of
situations. So, how does the robot
actually figure out what to do? Let's
get inside its head. It runs through
this continuous, super fast planning
loop. Step one, see. The robot looks at
the world as it is right now, and it's
given a picture of the goal, what it
should look like. Step two, imagine.
This is the amazing part. It uses its
world model to mentally play out
thousands of different action sequences.
What if I move left then down? What if I
go forward and close my hand? Step
three, calculate. For every one of those
imagined futures, it predicts the
outcome and figures out which one gets
it closest to that goal image. Step
four, select. It picks the best plan.
But here's the really clever bid. Step
five, execute. It only does the very
first step of that plan. And then step
six, repeat. It immediately throws the
rest of the plan away, looks at the
world again, and starts the entire see,
imagine, calculate loop from scratch.
This makes the robot incredibly adaptive
if something unexpected happens. And the
results, they are just incredible. This
table compares VJet 2AC to a top tier
model called Octo, which learns by
imitating humans. For a simple reach
task, they both get it right every time.
No surprise there. But look what happens
when it gets harder. For grasping an
object, the imitation model succeeds
less than 8% of the time. It just can't
adapt. VJ P2AC using its internal
planning nails it 45% of the time. And
for pick and place, the gap is even
bigger. The imitation model barely
works. While VJP2AC succeeds almost 73%
of the time, this just shows the raw
power of planning with a real world
model versus just trying to copy what
you've seen before. But what about
planning speed? This slide is
unbelievable. It compares VJPA 2AC to
Cosmos, another world model that plans
by predicting pixels. The difference is
just night and day. For Cosmos to plan a
single action takes 4 minutes. And
because it's so slow, it failed the pick
and place task every single time. VJPA2,
because it's planning in that efficient
abstract space, takes just 16 seconds.
It's not just better, it's monumentally
faster. This makes it a practical tool,
not just a lab experiment. These results
are a huge leap forward, and they really
point towards a future with much more
capable robots. We're about to get into
what that future might look like. So,
this is the perfect moment to remind you
to subscribe if you want to keep
following these breakthroughs with us.
Okay, so we've seen the problem. We've
seen the clever solution, the training
process, and the jaw-dropping results.
For this last part, let's just pull back
and look at the big picture. What can we
do with this? What are the limitations?
And what's the road ahead for this
amazing technology? The paper's own
conclusion really says it all. This work
is a powerful demonstration that this
recipe, learning from tons of passive
video and a little bit of physical
interaction can actually produce a world
model that is capable of real planning
in the real world. It's a huge
validation of a vision that people have
been chasing for years. So why does all
this matter for you and me? Well, the
potential is just massive. This is the
kind of foundational tech that could
finally give us the robotic assistance
we've always dreamed of. ones that can
actually handle the messiness of a real
home, not just a predictable factory
floor. Or think about wearable
assistants, maybe built into a pair of
glasses, that use a world model to warn
you about traffic or help you navigate a
crowded space. This isn't just about one
specific robot. It's about building the
foundation for groundbreaking apps in
all sorts of different fields. And it's
so important to remember that this model
is not a onetrick pony. It's
state-of-the-art across a whole range of
skills. It has amazing understanding for
classifying motion. It sets a new bar
for prediction, in guessing what a
person's about to do. We just spend a
bunch of time on its incredible ability
for zeroot planning and control. And
it's even a top performer in reasoning
about videos. It's a truly foundational
model for building more general
intelligence. Now, of course, the job
isn't done. The researchers are really
upfront about the limitations, which are
basically the next exciting research
problems to solve. The model is still a
bit sensitive to where the camera is
placed. It's also not great yet at
really long-term planning, like figuring
out all the steps to make a pot of
coffee. And right now, you have to give
it a picture of what you want. You can't
just tell it. So, the road ahead is all
about tackling these things, maybe with
models that can break big goals into
smaller steps. And the most exciting
one, connecting all this to natural
language so you can just talk to it. And
that really leaves us with one final big
question to think about. We've just seen
a model that can build a real intuitive
understanding of the world just by
watching. We've seen it use that mental
model to act and plan in totally new
situations. The next great step is
bridging that physical skill with our
language. So, I'll leave you with this
to chew on. What happens when an AI can
not only understand our world, but can
also plan and act inside it to achieve
complex goals that we describe to it in
plain simple language? Because that is
the future this research is building
towards.