Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI

Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416

5t1vTLU7s40 • 2024-03-07

Transcript preview

Open

Kind: captions
Language: en
I see the danger of this concentration
of power to to proprietary AI systems as
a much bigger danger than everything
else what works against this is people
who think that for reasons of security
we should keep AI systems under lock and
key because it's too dangerous to put it
in the hands of of everybody that would
lead to a very bad future in which all
of our information diet is controlled by
a small number of uh uh companies
through proprietary systems I believe
that people are fundamentally good and
so if
AI especially open source AI can um make
them smarter it just empowers the
goodness in humans so I I share that
feeling okay I think people are Fally
good uh and in fact a lot of doomers are
doomers because they don't think that
people are fundamentally
good the following is a conversation
with Yan laon his third time on this
podcast he is the chief AI scientist at
meta professor at NYU touring Award
winner and one of the seminal figures in
the history of artificial intelligence
he and meta AI have been big proponents
of open sourcing AI development and have
been walking the walk by open sourcing
many of their biggest models including
llama 2 and eventually llama 3 also Yan
has been an outspoken critic of those
people in the AI Community who warn
about the looming danger and existential
threat of AGI he believes the AGI will
be created one day but it will be good
it will not Escape human control nor
will it Dominate and kill all humans at
this moment of Rapid AI development this
happens to be somewhat a controversial
position and so it's been fun seeing Yan
get into a lot of intense and
fascinating discussions
online as we do in this very
conversation this is the lexman podcast
to support it please check out our
sponsors in the description and now dear
friends here's Yan
laon you've had some strong
statements technical statements about
the future of artificial intelligence
recently throughout your career actually
but recently as well uh you've said that
autoaggressive llms are uh not the way
we're going to make progress towards
superhuman intelligence these are the
large language models like GPT 4 like
llama 2 and 3 soon and so on how do they
work and why are they not going to take
us all the way for a number of reasons
the first is that there is a number of
characteristics of intelligent
behavior for example the capacity to
understand the world understand the
physical world
the ability to remember and retrieve
things um persistent memory the ability
to reason and the ability to plan those
are four essential characteristic of
intelligent um systems or entities
humans
animals lnms can do none of those or
they can only do them in a very
primitive way and uh they don't really
understand the physical world don't
really have persistent memory they can't
really reason and they certainly can't
plan and so you know if if if you expect
the system to become intelligent just
you know without having the possibility
of doing those things you're making a
mistake that is not to say that auto
regressive LS are not useful they're
certainly
useful um that they're not interesting
that we can't build a whole ecosystem of
applications around them of course we
can but as a path towards human level
intelligence they're missing essential
components and then there is another
tidbit or or fact that I think is very
interesting those llms are trained on
enormous amounts of text basically the
entirety of all publicly available text
on the internet right that's
typically on the order of 10 to the 13
tokens each token is typically two byes
so that's two 10 to the 13 bytes as
training data it would take you or me
170,000 years to just read through this
at eight hours a day uh so it seems like
an enormous amount of knowledge right
that those systems can
accumulate
um but then you realize it's really not
that much data if you you talk to
developmental psychologist and they tell
you a four-year-old has been awake for
16,000 hours in his
life um
and the amount of information that has
uh reached the visual cortex of that
child in four
years um is about 10 to the 15 bytes and
you can compute this by estimating that
the optical nerve carry about 20 megab
megabytes per second roughly and so 10^
the 15 bytes for a four-year-old versus
2 * 10 to 13 bytes for 170,000 years
worth of reading what it tells you is
that uh through sensory input we see a
lot more information than we than we do
through
language and that despite our
intuition most of what we learn and most
of our knowledge is through our
observation and interaction with the
real world not through language
everything that we learn in the first
few years of life and uh certainly
everything that animals learn has
nothing to do with language so it would
be good to uh maybe push against some of
of the intuition behind what you're
saying
so it is true there's several orders of
magnitude more data coming into the
human
mind much faster and the human mind is
able to learn very quickly from that
filter the data very quickly you know
somebody might argue your comparison
between sensory data versus language
that language is already very compressed
it already contains a lot more
information than the bytes it takes to
store them if you compare it to visual
data so there's a lot of wisdom and
language there's words and the way we
stitch them together it already contains
a lot of information so is it possible
that language alone already has
enough wisdom and knowledge in there to
be able to from that language construct
a a world model and understanding of the
world an understanding of the physical
world that you're saying L LMS lack so
it's a big debate among uh philosophers
and also cognitive scientists like
whether intelligence needs to be
grounded in
reality uh I'm clearly in the camp that
uh yes uh intelligence cannot appear
without some grounding in uh some
reality doesn't need to be you know
physical reality could be simulated but
um but the environment is just much
richer than what you can express in
language language is a very approximate
representation of our percepts and our
mental models right I mean there there's
a lot of tasks that we accomplish where
we manipulate uh a mental model of uh of
the situation at hand and that has
nothing to do with language everything
that's physical mechanical whatever when
we build something when we accomplish a
task model task of you know grabbing
something Etc we plan or action
sequences and we do this by essentially
Imagining the result of the outcome of
sequence of actions so we might
imagine and that requires mental models
that don't have much to do with language
and that's I would argue most of our
knowledge is derived from that
interaction with the physical world so a
lot of a lot of my my colleagues who are
more uh interested in things like
computer vision are really on that camp
that uh AI needs to be embodied
essentially and then other people coming
from the NLP side or maybe you know some
some other uh motivation don't
necessarily agree with that um and
philosophers are split as well uh and
the U the complexity of the world is
hard to um hard to imagine it you know
it's hard to represent uh all the
complexities that we take completely for
granted in the real world that we don't
even imagine require intelligence right
this is the old marac Paradox from the
pioneer of Robotics and SMC we said you
know how is it that with computers it
seems to be easy to do high Lev complex
tasks like playing chess and solving
integrals and doing things like that
whereas the thing we take for granted
that we do every day um like I don't
know learning to drive a car or you know
grabbing an object we can do as
computers um
and you know we have llms that can pass
pass the bar exam so they must be smart
but then they can't learn to drive in 20
hours like any 17y old they can't learn
to clear out the dinner table and F of
the dishwasher like any 10-year-old can
learn in one shot um why is that like
you know what what are we missing what
what type of learning or or reasoning
architecture or whatever are we missing
that um um basically prevent us from
from you know having level five sing
Cars and domestic robots can a large
language model construct a world model
that does know how to drive and does
know how to fill a dishwasher but just
doesn't know how to deal with visual
data at this time so it it can operate
in space of Concepts so yeah that's what
a lot of people are working on so the
answer the short answer is no and the
more complex sensor is you can use all
kind of tricks to get uh uh an llm to
basically digest U visual
representations of representations of
images uh or video or audio for that
matter um and uh a classical way of
doing this is uh you train a vision
system in some way and we have a number
of ways to train Vision systems either
supervised semisupervised self superise
all kinds of different ways uh that will
turn any image into
high level
representation basically a list of
tokens that are really similar to the
kind of tokens that
uh typical llm takes as an input and
then you just feed that to the llm in
addition to the text and you just expect
LM to kind of uh you know during
training to kind of be able to uh use
those representations to help make
decisions I mean there been work along
those line for for quite a long time um
and now you see those systems right I
mean there are llms that can that have
some Vision extension but they're
basically hacks in the sense that um
those things are not like trained end to
end to to handle to really understand
the world they're not trained with video
for example uh they don't really
understand intuitive physics at least
not at the moment so you don't think
there's something special to about
intuitive physics about sort of Common
Sense reasoning about the physical space
about physical reality that's that to
you is a giant leap that llms are just
not able to do we're not going to be
able to do this with the type of llms
that we are uh working with today and
there's a number of reasons for this but
uh the main reason
is the way llm LMS are trained is that
you you take a piece of text you remove
some of the words in that text you Mass
them you replace by replace them by
blank markers and you train a gtic
neural net to predict the words that are
missing uh and if you build this neural
net in a particular way so that it can
only look at u words that are to the
left of the one is trying to predict
then what you have is a system that
basically is trying to predict the next
word in a text right so then you can
feed it um a text a prompt and you can
ask it to predict the next word it can
never predict the next word exactly and
so what it's going to do is uh produce a
probability distribution over all the
possible words in your dictionary in
fact it doesn't predict words it
predicts tokens that are kind of subword
units and so it's easy to handle the
uncertainty in the prediction there
because there's only a finite number of
possible words in the dictionary and you
can just compute a distribution over
them um then what you what the system
does is that it it picks a word from
that distribution of course there's a
higher chance of picking words that have
a higher probability within that
distribution so you sample from that
distribution to actually produce a word
and then you shift that word into the
input and so that allows the system not
to predict the second word right and
once you do this you shift it into the
input Etc that's called Auto regressive
prediction and which is why those llms
should be called Auto regressive llms uh
but we just call them
LMS
and there is a difference between this
kind of process and a process by which
before producing a word when you talk
when you and I talk you and I are
bilinguals M we think think about what
we're going to say and it's relatively
independent of the language in which
we're going to say when we when we talk
about like uh I don't know let's say a
mathematical concept or something the
kind of thinking that we're doing and
the answer that we're planning to
produce is not linked to whether we're
going to see it in French or Russian or
English chsky just rolled his eyes but I
understand so you're saying that there's
a a bigger abstraction that repes that's
uh that goes before language yeah maps
onto language right it's certainly true
for a lot of thinking that we that we do
is that obvious that we don't like
you're saying your thinking is same in
French as it is in English yeah pretty
much yeah pretty much or is this like
how how flexible are you like if if
there's a probability
distribution well it it depends what
kind of thinking right if it's just uh
if it's like producing puns I get much
better in French than English about that
no but so worse is an abstract
representation of puns like is your
humor an abstract like when you tweet
and your tweets are sometimes a little
bit spicy uh what's is there an abstract
representation in your brain of a tweet
before it maps onto English there is an
abstract representation of uh Imagining
the reaction of a reader to that uh text
or you start with laughter and then
figure out how to make that happen or
figure out like a reaction you want to
cause and and then figure out how to say
it right so that it causes that reaction
but that's like really close to language
but think about like a math mathematical
concept or um you know imagining you
know something you want to build out of
wood or something like this right the
kind of thinking you're doing has
absolutely nothing to do with language
really like it's not like you have
necessarily like an internal monologue
in any particular language you're you're
you know imagining mental models of of
the thing right I mean if I if I ask you
to like imagine what this uh water
bottle will look like if I rotate it 90
degrees
um that has nothing to do with
language and so uh so clearly there is
you know a more abstract level of
representation uh in which we we do most
of our thinking and we plan what we're
going to say if the output is
is you know uttered words as opposed to
an output being uh you know muscle
actions right um we we plan our answer
before we produce it and LMS don't do
that they just produce one word after
the other instinctively if you want it's
like it's a bit like the you know
subconscious uh actions where you don't
like you're distracted you're doing
something you're completely concentrated
and someone comes to you and you know
asks you a question and you kind of
answer the question you don't have time
to think about the answer but the answer
is easy so you don't need to pay
attention you sort of respond
automatically that's kind of what an llm
does right it doesn't think about it
sensor really uh it retrieves it because
it's accumulated a lot of knowledge so
it can retrieve some some things but
it's going
to just spit out one token after the
other without planning the answer but
you're making it sound just one token
after the other one token at a time
generation is uh bound to be
simplistic but if the world model is
sufficiently sophisticated that one
token at a
time the the most likely thing it
generates is a sequence of tokens is
going to be a deeply profound thing okay
but then that assumes that those systems
actually possess an internal World model
so it really goes to the I I think the
fundamental question is can you build a
a
really complete World model not complete
but a uh one that has a deep
understanding of the world yeah so can
you build this first of all by
prediction right and the answer is
probably yes can you predict can you
build it by predicting words and the
answer is most probably no because
language is very poor in terms or weak
or low bandwidth if you want there's
just not enough information there so
building World models means observing
the
world and uh
understanding why the world is evolving
the way the way it is and then uh the
the extra component of a world model is
something that can predict how the world
is going to evolve as a consequence of
an action you might take right so what
model really is here is my idea of the
state of the world at time te here is an
action I might take what is the
predicted state of the world at Mt plus1
now that state of the world doesn't does
not need to represent everything about
the world it just needs to represent
enough that's relevant for this planning
of of the action but not necessarily all
the details now here is the problem um
you're not going to be able to do this
with generative models so genery model
has trained on video and we've tried to
do this for 10 years you take a video
show a system a piece of video and then
ask you to predict the reminder of the
video basically predict what's going to
happen one frame at a time do the same
thing as sort of the autoaggressive llms
do but for video right either one FR at
a time or a group of friends at a time
um but yeah uh a large video model if
you
want uh the idea of of doing this has
been floating around for a long time and
at at Fair uh some colleagues and I have
been trying to do this for about 10
years um and you can't you can't really
do the same trick as with LM because uh
you know llms as I said you can't
predict exactly which word is going to
follow a sequence of words we can
predict the distribution over words now
if you go to video what you would have
to do is predict the distribution over
all possible frames in a video and we
don't really know how to do that
properly uh we we do not know how to
represent distributions over High
dimensional continuous spaces in ways
that are
useful uh and and that's that there lies
the main issue and the reason we can do
this is because the world is incredibly
more complicated and richer in terms of
information than than text text is
discret video is high dimensional and
continuous a lot of details in this um
so if I take a a video of this room uh
and the video is you know a camera
panning around MH um there is no way I
can predict everything that's going to
be in the room as I pan around the
system cannot predict what's going to be
in the room as the camera is panning
maybe it's going to predict this is this
is a room where there's a light and
there is a wall and things like that it
can't predict what the painting on the
wall looks like or what the texture of
the couch looks like certainly not the
texture of the carpet so there's no way
I can predict all those details so the
the way to handle this is one way
possibly to handle this which we've been
working for a long time is to have a
model that has
what's called a latent variable and the
latent variable is fed to an Nal net and
it's supposed to represent all the
information about the world that you
don't perceive yet and uh that you need
to
augment uh the the system for the
prediction to do a good job at
predicting pixels including the you know
fine texture of the of the carpet and
the on a couch and and the painting on
the wall
um uh that has been a complete failure
essentially and we've tried lots of
things we tried uh just straight neural
Nets we tried Gans we tried uh you know
Vees all kinds of regularized Auto
encoders we tried um many things we also
tried those kind of methods to learn uh
good representations of images or video
um that could then be used as input to
for example an image classification
system mhm and that also was basically
failed like all the systems that attempt
to predict missing parts of an image or
video um you know from a corrupted
version of it basically so right take an
image or a video corrupt it or transform
it in some way and then try to
reconstruct the complete video or image
from the corrupted
version and then hope that internally
the system will develop a good
representations of images that you can
use for object recognition segmentation
whatever it is
is that has been essentially a complete
failure and it works really well for
text that's the principle that is used
for LMS right so where is the failure
exactly is that that it's very difficult
to form a good representation of an
image a good in like a good embedding of
all all the important information in the
image is it in terms of the consistency
of image to image to image the image
that forms the video like where what are
the if we do a highlight reel of all the
ways you failed what what's that look
like okay so the reason this doesn't
work uh is first of all I have to tell
you exactly what doesn't work because
there is something else that does work
uh so the thing that does not work is
training a system to learn
representations of images by training it
to
reconstruct uh a good image from a
corrupted version of it okay that's what
doesn't work and we have a whole slew of
technique for this uh that are you know
variant of ding Auto encoders something
called Mee developed by some of my
colleagues at Fair Max Doo encoder so
it's basically like
the you know llms or or or or things
like this where you train the system by
corrupting text except you corrupt
images you remove Patches from it and
you train a gigantic neet to reconstruct
the features you get are not good and
you know they're not good because if you
now train the same architecture but you
train it supervised mhm with with uh
label data with Tex textual descriptions
of images Etc you do get good
representations and the performance on
recognition tasks is much better than if
you do this self-supervised free trining
so the architecture is good the
architecture is good the architecture of
the encoder is good okay but the fact
that you train the system to reconstruct
images does not lead it to produce to
learn good generic features of images
when you train in a self-supervised way
self-supervised by reconstruction Yeah
by reconstruction okay so what's the
alternative the alternative is joint
embedding what is joint embedding what
are what are these architectures that
you're so excited about okay so now
instead of training a system to encode
the image and then training it to
reconstruct the the full image from a
corrupted version you take the full
image you take
the corrupted or transformed version you
run them both through encoders mhm which
in general are identical but not
necessarily and then you you
train a predictor on top of those uh
encoders um to predict the
representation of the full input from
the representation of the corrupted
one okay so joint embedding because
you're you're taking the the full input
and the corrupted version or transform
version run them both through encoders
so you get a joint embedding and then
you and then you're you're saying can I
predict the representation of the full
one from the representation of the
corrupted one okay um and I call this a
JEA so that means joint embedding
predictive architecture because it's
joint embedding and there is this
predictor that predicts the
representation of the good guy from from
the bad
guy um and the big question is how do
you train something like this uh and
until five years ago or six years ago we
didn't have particularly good answers
for how you train those things except
for one um called contrastive
contrastive
learning
where um and the IDE contrastive
learning is you you take a pair of
images that are again an image and a
corrupted version or degraded version
somehow or transform version of the
original one and you train the predicted
representation to be the same as as that
if you only do this the system collapses
it basically completely ignores the
input and produces representations that
are con
so the contrastive methods avoid this
and and those things have been around
since the early 90s had a paper on this
in
1993 um is you also show pairs of images
that you know are different and then you
push away the representations from each
other so you say not only do
representations of things that we know
are the same should be the same or
should be similar but representation of
things that we know are different should
be
different and that prevents the collapse
but it has some limitation and there's a
whole bunch of uh techniques that have
appeared over the last six seven years
um that can revive this this type of
method um some of them from Far some of
them from from Google and other places
um but there are limitations to those
contrasting method what has changed in
the last
uh you know three four years is now now
we have methods that are non-contrastive
so they don't require those negative
contractive samples of images that are
that we know are different you can only
you turn them only with images that are
you know different versions or different
views of the same thing uh and you rely
on some other tricks to prevent the
system from collapsing and we have have
a dozen different methods for this now
so what is the fundamental difference
between joint embedding architectures
and llms so can uh can japa take us to
AGI whether we should say that you don't
like uh the term AGI and we'll probably
argue I think every single time I've
talked to you with argued about the G
and AGI yes get I get it I get it we
we'll probably continue to argue about
it it's great uh you you like uh I me
this because cuz you like French and um
I me is is is uh I guess friend in
French yes and Ami stands for advanced
machine intelligence right um but either
way can japa take us to that towards
that advanced machine intelligence well
so it's a it's a first step okay so
first of all uh what What's the
difference with generative architectures
like llms um so llms um
or Vision systems that are trained by
reconstruction generate the inputs right
they generate the original input that is
non-corrupted
non-transformed right so you have to
predict all the
pixels and there is a huge amount of
resources spent in the system to
actually predict all those pixels all
the
details uh in a jepa you're not trying
to predict all the pixels you're only
trying to predict an abstract
representation of of the inputs right
and that's much easier in many ways so
what the japa system when it's being
trained is trying to do is extract as
much information as possible from the
input but yet only EXT ract information
that is relatively easily
predictable okay so there's a lot of
things in the world that we cannot
predict like for example if you have a s
driving car driving down the street or
road uh there may be uh trees around the
around the road and it could be a windy
day so the the leaves on the tree are
kind of moving in kind of semi chaotic
random ways that you can't predict and
you don't care you don't want to predict
so what you want is your encoder to
basically eliminate all those details
will tell you there's moving leaves but
it's not going to keep the details of
exactly what's going on um and so when
you do the prediction in representation
space you're not going to have to
predict every single Pixel of a
relief and that you know um not only is
a lot simpler but also it allows the
system to essentially learn an abstract
representation of of the world where you
know what can be modeled and predicted
is preserved and the rest is viewed as
noise and eliminated by the encoder so
it kind of lifts the level of
abstraction of the representation if you
think about this this is something we do
absolutely all the time whenever we
describe a phenomenon we describe it at
a particular level of abstraction and we
don't always describe every natural
phenomenon in terms of quantum field
Theory right that would be impossible
right so we have multiple levels of
abstraction to describe what happens in
the world you know starting from Quantum
field Theory to like atomic theory and
molecules you know in chemistry
materials and you know all the way up to
you know kind of concrete objects in the
real world and things like that so the
we we can't just only model everything
at the lowest level and that that's what
the idea of JEA is really on is really
about learn abstract representation in a
self-supervised uh Manner and you know
you can do it hierarchically as well so
that I think is an essential component
of an intelligent system and in language
we can get away without doing this
because language is already to some
level abstract and already has
eliminated a lot of information that is
not predictable and um so we can get
away without doing the tring without you
know lifting the abstraction level and
by directly predicting words so joint
embedding it's still generative but it's
generative in this abstract
representation space yeah and you're
saying language we were lazy with
language cuz we already got the abstract
representation for free and now we have
to zoom out actually think about
generally intelligent systems we have
to deal with a full mess of physical
reality of reality and you can't you you
do have to do this step of jumping
from uh the full Rich detailed reality
to a uh abstract representation of that
reality based on which you can then
reason and all that kind of stuff right
and the thing is those cell supervised
algorithm that that learn by
prediction even in representation space
uh they learn more uh concept if the
input data you Feit them is more
redundant the more redundancy there is
in the data the more they're able to
capture some internal structure of it
and so there there is way more
redundancy in structure in perceptual uh
inputs sensory input like like like
Vision than there is in
uh text which is not nearly as redundant
this is back to the question you were
asking a few minutes ago language might
represent more information really
because it's already compressed you're
you're right about that but that means
it's also less redundant and so self
supervision will not work as well is it
possible to join the self-supervised
training on visual data and
self-supervised training on language
data there is a huge amount of knowledge
even though you talk down about those 10
to the 13 tokens those 10 to the 13
tokens represent the
entirety a large fraction of what US
humans have figured out both the
talk on Reddit and the contents of all
the books and the Articles and the full
spectrum of
human uh intellectual creation so is it
possible to join those two together well
eventually yes but I think uh if we do
this too early
we run the risk of being tempted to
cheat and in fact that's what people are
doing at the moment with vision language
model we're basically cheating we are
using uh language as a crutch to help
the deficiencies of our uh Vision
systems to kind of learn good
representations from uh images and video
and uh the problem with this is that we
might you know improve our uh visual
language system a bit I mean our
language models by you know feeding them
image
but we're not going to get to the level
of even the intelligence or level of
understanding of the world of a cat or
dog which doesn't have language you know
they don't have language and they
understand the world much better than
any llm they can plan really complex
actions and sort of imagine the result
of a bunch of actions how do we get
machines to learn that before we combine
that with language obviously if we
combine this with language this is going
to be a winner um but but before that we
have to focus on like how do we get
systems to learn how the world works so
this kind of joint embedding predictive
architecture for you that's going to be
able to learn something like Common
Sense something like what a cat uses to
predict how to mess with its owner most
optimally by knocking over a thing
that's that's the Hope in fact the
techniques we're using are
non-contrastive uh so not only is the
architecture non generative the learning
procedures we're using are non
contrastive we have two two sets of
techniques one set is based on
distillation and there's a number of uh
methods that use this principle uh one
by Deep Mind
Bol a couple by by Fair one one called
uh
VRA and another one called IA and vcra I
should say is not a distillation method
actually but IA and B certainly are and
there's another one also called Dino or
dyo also produced from at fair and the
idea of those things is that you take
the full input let's say an image uh you
run it through an
encoder uh produces a representation and
then you corrupt that input or transform
it running to the essentially what
amounts to the same encoder with some
minor
differences and then train a predictor
sometimes to predictor is very simple
sometime doesn't exist but train a
predictor to predict a representation of
the first first uh uncorrupted input
from the corrupted input um but you only
train the the second Branch um you only
train the part of the network that is
fed with the corrupted input the other
network you don't you don't train but
since they share the same weight when
you modify the first one it also
modifies the second one uh and with
various tricks you can prevent the
system from collapsing uh with the
collapse of the type I was explaining
before where the system basically
ignores the input
um so that works very well the the
technique with the two techniques we
develop at Fair uh dino and uh and IA
work really well for that so what kind
of data are we talking about here so
this the several scenario one uh one
scenario is you take an image you
corrupt it by um changing the cropping
for example changing the size a little
bit maybe changing the orientation
blurring it changing the colors doing
all kinds of horrible things to it but
basic horrible things basic horrible
things that sort of degrade the quality
a little bit and change the framing uh
you know crop the image um or and in
some cases in the case of a JEA you
don't need to do any of this you just
you just mask some parts of it right you
just basically remove some regions like
a big block
essentially and and then you know run
through the encoders um and train the
entire system and and predictor to
predict the representation of the good
one from the representation of the
corrupted
one um so that's the Ia doesn't need to
know that it's an image for example
because the only thing it needs to know
is how to do this masking um whereas
with doo you need to know it's an image
because you need to do things like you
know geometri transformation and
blurring and things like that that are
really image
specific uh a more recent version of of
this that we have is called V JEA so is
basically the same idea as I except um
it's applied to video so now you take a
whole video and you mask a whole chunk
of it and what we mask is actually kind
of a temple tube so an all like a whole
uh segment of each frame in the video
over the entire video and that tube was
like statically position throughout the
frames lit straight tube the tube yeah
typically is 16 frames or something and
we mask the same region over the entire
16 frames it's a different one for every
video obviously and um and then again
train that system so as to predict the
representation of the full video from
The partially matched video uh that
works really well it's the first system
that we have that learns good
representations of video so that when
you feed those representations to a
supervised uh classifier head it can it
can tell you what action is taking place
in the video with you know pretty good
accuracy um so that that's it's the
first time we get something of that uh
of that quality so that that's a a good
test that a good representation is
formed that means there's something to
this yeah um we also preliminary result
that seem to indicate that the
representation allows us allow our
system to tell whether the video is
physically possible or completely
impossible because some object
disappeared or an object you know
suddenly jumped from one location to
another or or change shape or something
so it's able to capture some physical
con some physic based constraints about
the reality represented in the video
yeah about the appearance and The
Disappearance of objects yeah that's
really you okay but C can this
actually get us to this kind of uh World
model
that understands enough about the world
to be able to drive a car uh possibly um
this is going to take a while before we
get to that point but um um and there
are systems already you know everybody
systems that are based on this uh idea
uh and the what you need for this is a
slightly modified version of this where
um imagine that you
have uh a video and the a complete video
and what you're doing to this video is
that you're either translating it in
time towards the future so you only see
the beginning of the video but you don't
see the latter part of it that is in the
original one or you just mask the second
half of the video for example um and
then you you train a JEA system of the
type I describe to predict the
representation of the full video from
the the shifted one but you also feed
the predictor with an action for example
you know the wheel is turned 10 degrees
to the to the right or something right
so if it's a you know a dash cam in a
car and you know the angle of the wheel
you should be able to predict to some
extent what's going what's going to go
what's going to happen to which to see
uh you're not going to be able to
predict all the details of you know
objects that appear in the view
obviously but at a abstract
representation level you can you can
probably predict what's going to happen
so now what you have
is a internal model that says here is my
idea of state of the world at time T
here is an action I'm taking here's a
prediction of the state of the world at
time t plus one t plus Delta t t plus 2
seconds whatever it is if you have a
model of this type you can use it for
planning so now you can do what llms
cannot do which is planning what you're
going to do so as to arrive at a
particular uh outcome or satisfy a
particular objective right so you can
have a number of
objectives um right if you know I can I
can predict that uh if I have uh an
object like this right and I open my
hand it's going to fall right and uh and
if if I push it with a particular force
on the table it's going to move if I
push the table itself it's probably not
going to move uh with the same Force um
so we have we have this internal model
of the world in our in our mind uh which
allows us to plan sequences of actions
to arrive at a particular goal um and so
um so now if you have this world model
we can imagine a sequence of actions
predict what the outcome of the sequence
of action is going to be measure to what
extent
the final State satisfies a particular
objective like you know moving the
bottle to the left of the
table um and then plan a sequence of
actions that will minimize this
objective at run time we're not talking
about learning we're talking about
inference time right so this is planning
really and in optimal control this is a
very classical thing it's called Uh
model predictive control you have a
model of the system you want to control
that you know can predict the sequence
of State St corresponding to a sequence
of
commands and you're planning a sequence
of commands so that according to your
world model the the the end state of the
system will uh satisfy an objectives
that you fix this is the
way uh you know rocket trajectories have
been planned since computers have been
around so since the early 60s
essentially so yes for model predictive
control but you also often talk about
hierarchical planning can hierarchical
planning emerge from this somehow well
so no you you will have to build
specific architecture to allow for
hierarchical planning so hierarchical
planning is absolutely necessary if you
want to plan complex
actions uh if I want to go from let's
say from New York to Paris this the
example I use all the time and I'm
sitting uh in my office at NYU my
objective that I need to minimize is my
distance to Paris at a high level a very
astract representation of my uh my
location I would have to decompose this
into two sub goals first one is um go to
the airport second one is catch a plane
to Paris okay so my sub goal is
now uh going to the airport my objective
function is my distance to the
airport how do I go to the airport where
I have to go in the street and H a taxi
which you can do in New
York um okay now I have another sub goal
go down on the street uh well that means
going to the elevator going down the
elevator walk out the street how do I go
to the elevator I have to
uh stand up from my chair open the door
of my office go to the elevator push
push the button how do I get up from my
chair like you know you can imagine
going down all the way down to basically
what amounts to millisecond by
millisecond muscle
control okay and obviously you're not
going to plan your entire trip from New
York to Paris
in terms of millisecond by millisecond
muscle control first that would be
incredibly expensive but it will also be
completely impossible because you don't
know all the conditions of what's going
to happen uh you know how long it's
going to take to catch a taxi um or to
go to the airport with traffic you know
uh I mean you you would have to know
exactly the condition of everything to
be able to do this planning and you
don't have the information so you you
have to do this hierarchical planning so
that you can start acting and then sort
of replanning as you go and nobody
really knows how to do this in AI um
nobody knows how to train a system to
learn the appropriate multiple levels of
representation so that hierarchical
planning Works does something like that
already emerge so like can you use an
llm state-ofthe-art llm to get you from
New York to Paris by doing exactly the
kind of detailed set of questions that
you just did
which is can you give me a highight a
list of 10 steps I need to do to get
from New York to Paris and then for each
of those steps can you give me a list of
10 steps how I make that step happen and
for each of those steps can you give me
a list of 10 steps to make each one of
those until you're moving your mus
individual muscles uh maybe not whatever
you can actually act upon using your
mind right so there's a lot of questions
that are sort implied by this right so
the first thing is llms will be able to
answer some of those questions down to
some level of
exraction under the condition that
they've been trained with similar
scenarios in their training set they
would be able to answer all those
questions but some of them may be
hallucinated meaning non-factual yeah
true I mean they will probably produce
some answer except they're not going to
be able to really kind of produce
millisecond by millisecond muscle
control of how you how you stand up from
your chair right so but down to some
level of exraction we can describe
things by words they might be able to
give you a plan but only under the
condition that they've been trained to
produce those kind of plans mhm right
they're not going to be able to plan for
situations where that that they never
encountered before they basically are
going to have to regurgitate the
template that they've been trained on
but where like just for the example of
New York to Paris is is it going to
start getting into trouble like at which
layer layer of abstraction do you think
you'll start cuz like I can imagine
almost every single part of that anal
will be able to answer somewhat
accurately especially when you're
talking about New York and Paris major
cities so I mean certainly uh LM would
be able to solve that problem if you f
tun need for it you know just uh and and
so uh I can't say that nlm cannot do
this it can do this if you train it for
it there's no question uh down to a
certain level where things can be
formulated in terms of words but like if
you want to go down to like how do you
you know climb down the stairs or just
stand up from your chair in terms of uh
words like you you can't you can't do it
um you you need that's one of the
reasons you need experience of the
physical world which is much higher
bandwidth than what you can express in
words in human language so everything
we've been talking about on the joint
embedding space is it possible that
that's what we need for like the
interaction with physical reality for on
the robotics front and then just the
llms are the thing that sits on top of
it for the bigger reasoning about like
yeah the fact that I need to book a
plane ticket and I need to know I know
how to go to the websites and so on sure
and you know a lot of plans that people
know about um that are relatively high
level are actually learned they're not
people most people don't invent the you
know plans um uh they
they by themselves they uh you know we
have some ability to do this of course
uh obviously but um but but most plants
that people use are plants that they've
been trained on like they've seen other
people use those plants or they've been
told how to do things right um that you
can't invent how you like take a person
who's never heard of airplanes and tell
them like how do you go from New York to
Paris and they're probably not going to
be able to kind of you know deconstruct
the whole plan unless they've seen
examples of that before um so certainly
LMS are going to be able to do this but
but then um how you link this from the
the low level
of of of actions uh that needs to be
done with things like like Jad that
basically lift the abstraction level of
the representation without attempting to
reconstruct every detail of the
situation that's why we need Jass for I
would love to sort of Linger on your
skepticism
around uh autoaggressive
llms so one way I would like to test
that skepticism is everything you say
makes a lot of sense
but if I apply everything you said today
and in general to like I don't know 10
years ago maybe a little bit less no
let's say three years ago I wouldn't be
able to
predict the uh success of llms so does
it make sense to you that autoaggressive
llms are able to be so damn
good yes can you explain your intuition
because if I were to take your wisdom
and
intuition at face value I would say
there's no way autoaggressive LMS one
token at a time would be able to do the
kind of things they're doing no there's
one thing that auto llms uh or that llms
in general not just the autoaggressive
one but including the birth style bir
directional ones uh are exploiting and
it's self-supervised learning and I've
been a very very strong advocate of self
supervising for many years so those
things are
a incredibly impressive demonstration
that cell supervisor learning actually
works uh the idea that you know started
uh it didn't start with with uh with
Bert but it was really kind of a good
demonstration with this so the
the the idea that you know you take a
piece of text you corrupt it and then
you train some gigantic neural net to
reconstruct the parts that are
missing um that has been an enormous
uh produced an enormous amount of
benefits uh it allowed allowed us to
create systems that understand
understand language uh systems that can
translate um hundreds of languages in
any direction systems that are
multilingual so they're not it's a
single system that can be trained to
understand hundreds of languages and
translate in any
direction um and produce summaries um
and then answer questions and produce
text and then there's a special case of
it where you know you which is the auto
Progressive uh trick where you constrain
the system to not elaborate a
representation of the text from looking
at the enti text but
only predicting a word from the words
that are come before right and you do
this by the constraining the
architecture of the network and that's
what you can build an auto regressive
ATM from so there was a surprise many
years ago with what's called decoder
only llm so since you know systems of
this type that are just trying to
produce uh words from the from the
previous one and and the fact that when
you scale them up they they tend
to really kind of understand more about
the about language when you train them
on lot of data and you make them really
big that was kind of a surprise and that
surprise occurred quite a while back
like you know uh with uh work from uh
you know Google meta open AI Etc you
know going back to you know the GPT kind
of uh work General pre-train
Transformers do you mean like gbt2 like
there's a certain place where you start
to realize scaling might actually keep
giving us a an emergent benefit yeah I
mean there were there were work from
from various places but uh uh if if you
want to kind of you know place it in the
in the GPT uh timeline that would be
around gpt2 yeah well I just cuz you
said it you're you're so charismatic you
said so many words but self-supervised
learning yeah yes but again the same
intuition you're applying to saying that
autor regressive llms cannot have a deep
understanding of the world if we just
apply that same intuition does it make
sense to you that they're able to form
enough of a representation of the world
to be damn convincing
essentially passing the original touring
test with flying colors well we're
fooled by their fluency right we just
assume that if a system is is fluent in
manipulating language then it has all
the characteristics of human
intelligence but that impression is
false we we we're really fooled by it um
what do you think alen tan would say it
without understanding anything just
hanging out with it an Turing would
decide that a Turing test is a really
bad test okay this is what the AI
Community has decided many years ago
that the tring test was a really bad
test of intelligence what would Hans
marvac say about the about the large
language models hence Marv would say the
Marv Paradox still applies okay okay
okay we can pass you don't think he
would be really impressed no of course
everybody would be impressed but uh you
know uh it's not a question of being
impressed or not it's a question of
knowing what the limit of those systems
can do like there again they are
impressive they can do a lot of useful
t

Resume

# Masa Depan AI di Luar LLM: Wawancara Eksklusif Yann LeCun tentang JEPA, Open Source, dan AGI

### Inti Sari (Executive Summary)
Dalam wawancara ini, Yann LeCun, Chief AI Scientist di Meta, menyampaikan pandangan kritisnya bahwa Large Language Models (LLM) autoregresif saat ini (seperti GPT-4) bukanlah jalan utama menuju Artificial General Intelligence (AGI) karena kurangnya pemahaman tentang dunia fisik, memori persisten, penalaran, dan perencanaan. LeCun mengusulkan arsitektur baru bernama **JEPA (Joint Embedding Predictive Architecture)** yang berfokus pada pembuatan *world model* melalui prediksi representasi abstrak, bukan prediksi piksel atau teks. Ia juga menekankan pentingnya **Open Source** untuk mencegah konsentrasi kekuasaan AI di tangan segelintir perusahaan dan memastikan keberagaman budaya serta demokrasi di era digital.

### Poin-Poin Kunci (Key Takeaways)
*   **Keterbatasan LLM:** LLM autoregresif hanya memprediksi kata berikutnya dan tidak memiliki kemampuan perencanaan, pemahaman fisika intuitif, atau memori jangka panjang yang dibutuhkan untuk kecerdasan tingkat manusia.
*   **Solusi JEPA:** Arsitektur masa depan harus berupa *Joint Embedding Predictive Architecture* (JEPA) yang memprediksi representasi abstrak dari dunia, mengabaikan detail yang tidak relevan (seperti tekstur acak), sehingga jauh lebih efisien daripada model generatif.
*   **Pentingnya Open Source:** Kecerdasan buatan tidak boleh dikontrol oleh monopolasi perusahaan teknologi AS. Open source memungkinkan diversitas bahasa, budaya, dan nilai-nilai, serta mencegah sensor politik.
*   **Timeline AGI:** AGI tidak akan terjadi dalam waktu dekat. Masih diperlukan terobosan besar dalam *world model*, perencanaan hierarkis, dan efisiensi hardware (daya) yang mungkin memakan waktu puluhan tahun.
*   **Optimisme Manusia:** AI akan berfungsi sebagai penguat kecerdasan manusia (seperti mesin cetak) dan bukan ancaman eksistensial. LeCun percaya pada kebaikan dasar manusia dan bahwa teknologi yang terbuka akan memberdayakan kebaikan tersebut.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Kritik terhadap LLM dan Jalan Menuju AGI
Yann LeCun membuka diskusi dengan menantang anggapan bahwa LLM saat ini adalah jalan menuju superintelligence. Ia mengidentifikasi empat karakteristik yang hilang dari LLM:
1.  Memahami dunia fisik.
2.  Memori persisten (mengingat dan mengambil informasi).
3.  Penalaran (reasoning).
4.  Perencanaan (planning).

LLM dilatih pada data teks yang sangat besar (sekitar 10^13 token), namun ini setara dengan 170.000 tahun membaca. Sebaliknya, anak usia 4 tahun hanya belajar selama 16.000 jam tetapi memiliki pemahaman dunia yang jauh lebih luas karena input sensorik (visual, interaksi) yang bandwidth-nya jauh lebih tinggi daripada bahasa. Bahasa hanyalah representasi terkompresi dari pengetahuan; sebagian besar pengetahuan manusia diperoleh melalui observasi dan interaksi dengan dunia nyata, bukan hanya bahasa.

#### 2. Mengapa Prediksi Teks dan Video Murni Gagal
Mencoba membuat AI yang cerdas hanya dengan memprediksi teks atau setiap piksel video dianggap tidak efisien.
*   **Masalah Dimensi Tinggi:** Video memiliki dimensi yang tinggi dan kontinu. Memprediksi setiap detail (seperti gerakan daun yang acak atau tekstur dinding) adalah hal yang mustahil dan tidak perlu.
*   **Kegagalan Rekonstruksi:** Metode yang mencoba merekonstruksi gambar dari versi yang rusak (seperti VAE atau GAN) gagal menghasilkan fitur yang baik untuk visi komputer.
*   **Solusi Abstraksi:** Kita harus melompat dari detail sensorik mentah ke representasi abstrak. Bahasa manusia sudah abstrak, itulah sebabnya LLM bekerja cukup baik untuk teks, tetapi untuk berinteraksi dengan dunia fisik, mesin harus belajar abstraksi dari data sensorik (visual) secara mandiri.

#### 3. Pengenalan Arsitektur JEPA (Joint Embedding Predictive Architecture)
LeCun memperkenalkan JEPA sebagai solusi untuk mengatasi keterbatasan model generatif.
*   **Cara Kerja:** Alih-alih memprediksi piksel, JEPA mengambil input (gambar/video) dan versi yang sudah dimanipulasi (disensor/diubah), lalu melatih prediktor untuk menebak representasi abstrak dari input asli.
*   **Keunggulan:** Metode ini mengabaikan detail yang tidak dapat diprediksi (seperti pola acak) dan hanya fokus pada fitur penting yang dapat diprediksi. Ini jauh lebih hemat energi dan komputasi.
*   **Implementasi:** Teknik ini telah digunakan dalam model seperti **DINO** dan **V-JEPA** (untuk video), yang mempelajari representasi dunia dengan memprediksi "tabung" ruang-waktu (spatiotemporal) dalam video.

#### 4. Perencanaan (Planning) dan Hierarki
Kecerdasan membutuhkan kemampuan merencanakan tindakan untuk mencapai tujuan.
*   **Model Predictive Control:** Sistem masa depan harus memiliki *world model* internal yang dapat memprediksi keadaan masa depan (misalnya t+1, t+2) berdasarkan aksi saat ini. Ini memungkinkan simulasi skenario tanpa melakukannya di dunia nyata.
*   **Perencanaan Hierarkis:** Manusia tidak merencanakan setiap gerakan otot saat bepergian (misalnya dari NY ke Paris). Kita merencanakan pada tingkat abstraksi (pergi ke bandara, naik pesawat), lalu turun ke detail. AI saat ini belum bisa melakukan pembelajaran hierarki ini secara otomatis.
*   **Peran LLM:** LLM mungkin berguna untuk penalaran tingkat tinggi (seperti memesan tiket), tetapi tidak bisa mengontrol fisik (motorik) pada level detail.

## Kesimpulan & Pesan Penutup
Secara keseluruhan, wawancara ini menggarisbawahi bahwa LLM autoregresif memiliki batasan fundamental dalam mencapai AGI, terutama terkait pemahaman dunia fisik dan kemampuan perencanaan. Yann LeCun mengusulkan arsitektur JEPA sebagai solusi alternatif yang lebih efisien dan mampu membangun *world model* yang dibutuhkan mesin untuk berinteraksi dengan dunia nyata. Terakhir, ia menekankan bahwa pendekatan Open Source sangat krusial untuk mencegah monopoli kekuasaan AI serta memastikan teknologi ini memberikan manfaat yang demokratis dan beragam bagi umat manusia.

Read

file updated 2026-02-14 20:24:12 UTC