Is V-JEPA the End of the LLM Era? Yann LeCun's New Vision for AI

kjsw2JTn7jY • 2026-01-01

FoundationModelsForRobotics YouTube Transcript

Transcript preview

Open

Kind: captions
Language: en
Okay, so let's just jump right in. For
the last few years, the world of AI has
been all about one thing, right? The
large language model. But today, we are
looking at something that asks a pretty
wild question. What if that entire
approach, the very foundation of models
like chat, GPT, and Gemini is actually a
dead end. Meta's chief AI scientist, the
Turing Award winner, Yan Lun, just
co-authored a paper that feels less like
a small update and more like a a quiet
revolution. We're not just talking about
a better model here. We're talking about
a fundamentally different way to build
intelligence itself. A way that cares
more about understanding the world than
it does about just predicting the next
word. And honestly, this could change
everything we thought we knew about the
path to AGI. And that really gets us to
the heart of this whole deep dive. Are
we standing on the edge of a massive
shift? I mean, LLMs have given us some
incredible, almost magical stuff. They
can write code, draft legal documents,
even whip up poetry. But they also have
these huge fundamental weaknesses. They
hallucinate. They have zero common
sense. And they just don't have a real
grasp of the physical world. In a way,
they're just brilliant mimics. So what
if the future isn't about making the
mimicry more perfect? What if it's about
building something that doesn't have to
mimic at all because it actually
understands? What if there's a new kind
of AI coming that's not just an upgrade
but a completely different road
altogether? This one quote. This is the
bedrock of everything we're about to
talk about. For years, while the whole
world was obsessed with how well LLM
could talk, Yan Lun has been hammering
this exact point. His argument is that
we've become fascinated with the exhaust
fumes, the language, while completely
ignoring the engine, a true internal
model of how reality actually works. You
know, think about how a baby learns. A
baby doesn't learn about gravity by
reading a textbook. It learns by
dropping its spoon off the high chair a
100 times, watching what happens and
building this intuitive non-verbal model
of physics. The language to describe it
that comes way, way later. Lun says that
for AI to take that next big leap, it
has to do the same thing. Understand
first, talk sec. So, here's how we're
going to break this down. First, we'll
frame this new thing as a direct
challenge to the LLM throne. Then, we're
going to spend some real time on the key
difference between thinking and
generating. After that, we'll go inside
the mind of this new model, VJBA. We'll
look at the actual data to see if it
really is smaller, faster, and smarter.
Then, we'll zoom out to the big picture,
the grand vision of a world model, cuz
that's the real end goal here. And
finally, just to keep it real, we'll
look at some of the critiques and talk
about where this tech is at today. All
right, let's kick it off. Section one,
it's time to officially meet the
challenger in this story, V Jeepa. And
it's so important to get this right.
This is not just another horse in the
LLM race. This is a model built on a
whole different philosophy running in a
completely different race. So yeah, the
name is a total mouthful. Vision
language, joint embedding, predictive
architecture. But let's just unpack it
real quick. Vision language. Okay, it
connects what it sees with words. Joint
embedding. That just means it learns to
put images and text in the same idea
space. Predictive architecture is about
how it learns. But honestly, the most
important words here aren't even in the
name. They're non-generative. That is
the key that unlocks this whole thing.
See, unlike a model that's trained to
just guess the next pixel or the next
word, VJPAT is trained to predict a more
abstract idea of the content. Its job is
to build a rich internal concept, a real
understanding of what it's seeing. The
words, they're just a label you can
stick on that understanding. Okay, on to
section two. And this this is really the
core of it all. To get what VG PA is
doing, we have to really dig into this
fundamental split in AI philosophy. The
difference between generating an answer
and actually forming an understanding.
This is the paradigm shift we're talking
about. Now, what's so wild here is the
process itself. Generative AI literally
has to talk to think. Imagine you ask it
a complex question. It predicts the most
likely first word. Let's say 'the'. Then
based on your question and that word
'the', it predicts the next most likely
word, maybe answer. It just keeps going
like this, word by word, token by token,
until it decides it's done. It's kind of
like building a bridge one plank at a
time without seeing the other side of
the canyon. It just trusts that its
rules for placing the next plank will
get it there. It doesn't actually know
the full answer until it's finished
saying it. VJpaw's approach completely
different. It looks a whole video or
image and predicts a single holistic
meaning vector. Think of it like a
coordinate in a giant multi-dimensional
thought space. That one vector is the
understanding. It has all the rich info
about the scene packed into it. The
model thinks first in silence and then
translating that thought into human
language is a totally separate optional
step. And this analogy just nails the
difference. Using a generative AI is
like brainstorming with someone who
thinks out loud. They're exploring the
idea as they talk. And sometimes they go
down a weird path and say something that
makes no sense. That's what we call a
hallucination. It's discovery through
speaking. But talking to a
non-generative model is more like asking
an expert for their opinion. The expert
has already seen the situation,
processed everything internally, and
come to a stable conclusion. They aren't
figuring it out on the fly. They already
know. They're just waiting for you to
ask for the summary. The confidence, the
stability, the whole vibe, it's worlds
apart. One is a stream of consciousness.
The other is a settled conclusion. So,
here's the bottom line. This isn't some
minor technical tweak. It is a
fundamental shift in how AI reasons. For
years, we've been building systems that
reason in tokens. Their whole world is
made of little bits of language and
their thought is just the statistical
connection between those bits. The VJA
approach suggests a future where AI
reasons in meaning in a deeper, more
abstract space. In this new world,
language isn't the thought itself. It
becomes what it is for us, a user
interface for a much deeper
non-llinguistic understanding of the
world. Okay. So, what does this new way
of thinking in meaning vectors actually
look like in practice? Well, in section
3, we're going to try to peek inside the
mind of VJPA as it watches a video, and
we'll see how its ability to track
meaning over time gives it a way more
stable and coherent view of the world.
So, first, let's look at the old way of
doing things. A standard reactive vision
model watches a video like a person with
extreme short-term memory loss. It looks
at frame one. Its pattern matchers go
off and it shouts hand. Then the next
frame comes. It totally forgets the
first one and shouts bottle. It has no
context, no memory of what happened a
tenth of a second ago. Its output is
just this jumpy, chaotic stream of
guesses. It's not understanding an
action. It's just reacting to a series
of snapshots. This is why those older
systems were so easy to fool and their
descriptions felt so random. Here's a
perfect analogy for it. The old model is
like a cheap CCTV motion detector that
just yells out the name of whatever
object it thinks it sees every time a
pixel changes. It's just noise. VGPA on
the other hand is built to act more like
a person. When you watch a short video
clip, you don't narrate every single
millisecond. You don't say hand moving,
fingers extending, cylinder approaching.
You just watch patiently. You put the
information together over a few seconds
and then you come to a clear highle
conclusion. Ah, okay. He's picking
something up. That ability to wait,
watch, and synthesize is the key to
moving from just seeing to actually
understanding. So, how does it actually
do this? Well, the model doesn't just
spit out a final label. Internally, you
can kind of picture its thought process
as a cloud of possibilities. When a new
action starts, it might have a bunch of
initial low confidence guesses. In the
demos, you see these as flickering red
dots. Those are the model's first
impressions. But as it sees more frames,
it gathers more evidence. It sees the
hand keep moving. The fingers close
around the object. The object starts to
lift. And as that evidence piles up,
those scattered red dots of possibility
start to merge and drift toward a single
stable point in that meaning space. Once
its confidence is high enough, it locks
in. That's the blue dot. The blue dot is
a stabilized understanding. The moment
the AI says, "Okay, I'm now pretty sure
the action picking up a canister just
happened." This gives it a real sense of
time of a before, a during, and an
after. And hey, if you're geeking out
over this breakdown of how these EI
models work and you think the shift from
just pattern matching to real
understanding is as cool as I do, this
is exactly the kind of stuff we get into
every week. So, to make sure you don't
miss our next deep dive into the tech
that's literally shaping our world, take
a second and hit that subscribe button.
It seriously helps us out and it keeps
you ahead of the curve. All right, let's
get into section four. Now, everything
we've talked about so far is a really
cool theory, but in engineering and
machine learning, theory is cheap. What
matters are the results. So, does this
elegant, more humanlike way of
understanding actually perform any
better? Let's look at the numbers and
see if it really is smaller, faster, and
smarter. This is the million-doll
question, isn't it? A different way of
doing things is interesting for
researchers, but for the rest of us,
what really matters is performance. Does
thinking in meaning actually give you
better, more accurate results than the
old way of just thinking in words? Does
this elegance actually translate to
being more effective? Well, the paper
gives us some pretty clear answers.
Let's take a look at the scoreboard
here. The paper runs these tests
comparing VGPA to big powerful models
like clip. The tasks are called zeroot
or fshot learning, which is a really
important test of an AI's ability to
generalize. Basically, can it describe
or classify a video of something it's
never been explicitly trained on before?
And the results are pretty stark. On
things like video captioning and
classification, the older models learn
really slowly. VGPA, on the other hand,
just pulls way ahead, learning a lot
more from the same amount of data and
getting to a higher quality of
understanding much, much faster. This is
pretty strong proof that its internal
meaning vectors are just a more
efficient way to learn about the world
than just connecting pixels to words.
Now, here's the part that's really going
to blow your mind. Usually, when a new
model comes out and crushes the old
ones, you expect it to be some giant
energy guzzling monster. So, how big is
the VJA model they used in these tests?
It's about 1.6 6 billion parameters. And
that number is just it's incredible
because it gets these better results
with roughly half the number of
trainable parameters of a lot of the
traditional models it's up against. In
the world of machine learning, that is
the holy grail. It's like building a new
car engine that's twice as powerful but
gets twice the gas mileage. Getting way
better performance from a model that's
smaller, more efficient, and cheaper to
train and run. That's a huge, huge win.
It really suggests that this whole
approach isn't just different. It's
fundamentally better at getting the
essence of the data. So this brings us
to the really big idea in section 5.
VJR's crazy efficiency and performance.
That's not the end goal. It's a means to
an end. This was never just about making
better video classifiers. The ultimate
goal here is something way more
ambitious, something world changing,
creating a true world model. Yan Lun
just puts the state of AI in perfect
perspective with this quote. We've built
AIS that are amazing in the abstract
world of language. They can pass the bar
exam, which is all about manipulating
text. And yet, we have completely failed
to build AIs that can function in the
physical world. A robot that can
reliably clear your dinner table without
breaking a plate, is still science
fiction. A truly self-driving car that
can learn to drive with the speed and
intuition of a teenager. We are not
there yet. And the reason for that gap
is that these physical tasks need a deep
predictive understanding of cause and
effect, of physics, of how the world
works. They need common sense and that's
something language models just don't
have. And that that is the ultimate goal
of all of this research. The vision is
to create an AI that learns an intuitive
model of physics and causality just by
watching the world. Not by memorizing
equations from a textbook, but by
watching thousands of hours of video and
learning that things fall down, that
liquid spill, that you can't walk
through walls. It's about building a
model that can predict not the next word
in a sentence, but the next few frames
of a video. An AI that can mentally play
out what happens next is an AI that can
plan, reason, and act safely in the real
world. This is the missing piece for
robotics and for true autonomy. This
quote from Sonia Joseph, one of the Meta
AI researchers, really gets to why this
is so hard and why the Japa approach is
so promising. A useful world model
doesn't need to be a perfect physics
simulator on a supercomput. I mean, it's
impossible to simulate every atom in a
room just to predict if a couple fall.
We humans don't do that. We work with a
simplified intuitive model of physics.
We get concepts like gravity and
momentum at just the right level of
abstraction to make good predictions.
The hope is that by training Japas to
predict abstract ideas instead of raw
pixels, they can also learn to find this
efficient abstract level of
understanding, capturing the important
parts of physics without getting lost in
the details. But let's bring it back
down to earth for our last section. As
exciting as all this sounds, VJA is not
a magic wand. It is not a finished
product and it definitely has its flaws.
To really get the full picture, we need
to look at the criticisms and understand
where this model is at today on the long
road to smarter AI. So, when Meta put
out those demo videos, a really fair
criticism started popping up on places
like Reddit. People would pause the
videos and point out that, hey, the
real-time text descriptions were often
just wrong or made no sense. And that's
a totally valid observation of how it
performs right now. But focusing only on
that kind of misses the whole point of
the research. The rebuttal here is super
important. This paper is not a product
launch. It's a proof of concept. The
goal wasn't to release a model that's
100% accurate on day one. The goal was
to show that a non-generative predictive
approach can learn more efficiently and
build better representations of the
world than the other guys. It's about
proving that this direction is a more
promising path for the future, even if
this first step is a little wobbly. and
bringing you these balanced
perspectives, showing you the incredible
game-changing potential while also being
honest about the real world limitations.
That's what we're all about here. We
think the smartest take comes from
understanding both the hype and the
reality. If you appreciate that kind of
nuance when we talk about tech, hitting
that subscribe button is the absolute
best way to support what we do and make
sure you get the full honest picture on
these complex topics. So, we end on this
final huge question. is VJ Paw and this
whole Japa philosophy a turning point in
the story of AI is the best path to
smarter, more robust common sense AI not
through building even bigger language
models that are better at faking human
text, but instead through building
smaller, more efficient models that get
better at developing a genuine
predictive understanding of the world.
This research argues that the answer is
yes. It says we need to stop teaching AI
to be clever talkers and start teaching
them to be keen observers of reality.
And even if this first step is flawed,
it might just be the most important step
taken in years. It points the entire
field toward a new and maybe just maybe
a much better destination. The real
question is, are they right? What do you

Resume

Berikut adalah rangkuman komprehensif dan terstruktur berdasarkan transkrip "Bagian 1" yang Anda berikan.

***

# Revolusi AI: Mengapa LLM Bukan Masa Depan dan Munculnya V-JEPA

### Inti Sari (Executive Summary)
Dunia kecerdasan buatan saat ini sedang diguncang oleh wacana baru yang menyatakan bahwa fokus pada Large Language Models (LLM) seperti ChatGPT mungkin merupakan jalan buntu. Yann LeCun, Kepala Ilmuwan AI di Meta dan peraih Turing Award, memperkenalkan pendekatan revolusioner melalui makalah terbarunya yang mengusulkan pergeseran dari sekadar memprediksi bahasa menuju pemahaman dunia yang sebenarnya melalui arsitektur baru bernama V-JEPA.

### Poin-Poin Kunci (Key Takeaways)
*   **Kritik terhadap LLM:** LLM dianggap hanya sebagai peniru brilian yang kurang memiliki pemahaman, akal sehat, dan pegangan terhadap dunia fisik, sehingga rentan mengalami halusinasi.
*   **Filosofi Baru:** Masa depan AI seharusnya berfokus pada pemahaman (*understanding*) terlebih dahulu, bukan sekadar kemampuan berbicara atau memprediksi kata berikutnya.
*   **V-JEPA:** Model baru ini bernama *Vision Joint Embedding Predictive Architecture* (V-JEPA), yang bersifat *non-generative* dan berfokus pada prediksi ide abstrak.
*   **Berpikir vs Menghasilkan:** Paradigma baru ini membedakan antara "berbicara untuk berpikir" (LLM) dan "berpikir dalam diam" (V-JEPA), di mana bahasa hanyalah antarmuka, bukan inti dari proses berpikir.
*   **Pemahaman Video:** V-JEPA mampu memahami konteks video secara holistik seperti manusia, bukan hanya bereaksi frame per frame tanpa ingatan jangka pendek.

### Rincian Materi (Detailed Breakdown)

#### 1. Keterbatasan Large Language Models (LLM)
*   **Jalan Buntu:** Saat ini, dunia AI sangat terobsesi dengan LLM (seperti ChatGPT dan Gemini), namun ada pandangan bahwa pendekatan ini mungkin salah arah.
*   **Kurangnya Pemahaman:** LLM mungkin briliant dalam meniru, tetapi mereka tidak benar-benar memahami apa yang mereka katakan. Mereka kekurangan akal sehat dan pemahaman tentang dunia fisik.
*   **Masalah Halusinasi:** Karena LLM bekerja dengan memprediksi kata berikutnya secara statistis, mereka cenderung mengalami halusinasi atau "menemukan" jawaban melalui proses berbicara.

#### 2. Analogi Pembelajaran Manusia
*   **Belajar seperti Bayi:** Manusia tidak belajar fisika dengan membaca buku teks, melainkan melalui interaksi fisik (contoh: bayi belajar gravitasi dengan menjatuhkan sendok).
*   **Urutan Pembelajaran:** Konsep "Paham dulu, ngomong belakangan" (*Understand first, talk second*) menjadi kunci. Bahasa adalah label yang datang setelah kita memahami konsep dasar dunia.

#### 3. Perkenalan V-JEPA (Vision Joint Embedding Predictive Architecture)
*   **Pendekatan Berbeda:** V-JEPA bukan sekadar LLM lain, melainkan arsitektur dengan filosofi yang sama sekali berbeda.
*   **Non-Generative:** Fitur utama V-JEPA adalah sifatnya yang *non-generative*. Ia tidak memprediksi piksel atau kata demi kata, melainkan memprediksi ide atau konten abstrak.
*   **Konsep Internal:** Model ini membangun konsep internal di mana kata-kata hanyalah label tambahan, bukan inti dari pengetahuannya.

#### 4. Generatif vs. Non-Generatif: Cara Berpikir AI
*   **LLM (Generatif):** Bekerja dengan cara "berbicara untuk berpikir". Mereka membangun jembatan kata demi kata (token by token) tanpa melihat keseluruhan gambaran terlebih dahulu.
*   **V-JEPA (Non-Generatif):** Bekerja dengan cara "berpikir dalam diam". Ia memprediksi satu vektor makna yang holistik (koordinat dalam ruang pemikiran) sebelum menerjemahkannya ke dalam bahasa jika diminta. Ini seperti bertanya kepada seorang ahli yang sudah tahu jawabannya, dibandingkan seseorang yang merangkai jawaban saat berbicara.

#### 5. Pergeseran Paradigma: Bahasa sebagai Antarmuka
*   **Penalaran dalam Makna:** Pergeseran terjadi dari penalaran berbasis token (koneksi statistik) menuju penalaran berbasis makna dalam ruang abstrak.
*   **Bahasa Bukan Pikiran:** Bahasa menjadi sekadar antarmuka pengguna (*user interface*) untuk mengakses pemikiran AI, bukan merupakan media pikiran itu sendiri.

#### 6. Aplikasi Pemahaman Video
*   **Model Lama (Reaktif):** Seperti penderita hilang ingatan jangka pendek. Model lama hanya bereaksi pada setiap frame (misal: melihat "tangan", lalu "botol") tanpa konteks yang menyambung, menghasilkan kekacauan informasi.
*   **V-JEPA (Holistik):** Bekerja seperti manusia yang menonton dengan sabar. Model ini mensintesis informasi seiring waktu untuk mencapai kesimpulan tingkat tinggi (misalnya: "Dia sedang mengambil sesuatu").
*   **Mekanisme Pengambilan Keputusan:** Prosesnya dimulai dari awan kemungkinan (titik merah) yang luas, mengumpulkan bukti dari waktu ke waktu, dan akhirnya menggabungkannya menjadi satu titik stabil (titik biru) yang merepresentasikan pemahaman yang pasti.

### Kesimpulan & Pesan Penutup
Transkrip bagian pertama ini menegaskan bahwa masa depan kecerdasan buatan mungkin tidak terletak pada model bahasa yang lebih besar, melainkan pada arsitektur baru seperti V-JEPA yang mampu memahami makna dan konteks dunia nyata tanpa harus bergantung pada generasi teks yang berlebihan. Pergeseran ini menjanjikan AI yang lebih akurat, memiliki akal sehat, dan mampu merepresentasikan pemahaman secara mendalam mirip cara kerja manusia.

Read

file updated 2026-02-12 02:44:54 UTC