Transcript

5t1vTLU7s40 • Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0769_5t1vTLU7s40.txt
Back Raw
Kind: captions
Language: en
I see the danger of this concentration
of power to to proprietary AI systems as
a much bigger danger than everything
else what works against this is people
who think that for reasons of security
we should keep AI systems under lock and
key because it's too dangerous to put it
in the hands of of everybody that would
lead to a very bad future in which all
of our information diet is controlled by
a small number of uh uh companies
through proprietary systems I believe
that people are fundamentally good and
so if
AI especially open source AI can um make
them smarter it just empowers the
goodness in humans so I I share that
feeling okay I think people are Fally
good uh and in fact a lot of doomers are
doomers because they don't think that
people are fundamentally
good the following is a conversation
with Yan laon his third time on this
podcast he is the chief AI scientist at
meta professor at NYU touring Award
winner and one of the seminal figures in
the history of artificial intelligence
he and meta AI have been big proponents
of open sourcing AI development and have
been walking the walk by open sourcing
many of their biggest models including
llama 2 and eventually llama 3 also Yan
has been an outspoken critic of those
people in the AI Community who warn
about the looming danger and existential
threat of AGI he believes the AGI will
be created one day but it will be good
it will not Escape human control nor
will it Dominate and kill all humans at
this moment of Rapid AI development this
happens to be somewhat a controversial
position and so it's been fun seeing Yan
get into a lot of intense and
fascinating discussions
online as we do in this very
conversation this is the lexman podcast
to support it please check out our
sponsors in the description and now dear
friends here's Yan
laon you've had some strong
statements technical statements about
the future of artificial intelligence
recently throughout your career actually
but recently as well uh you've said that
autoaggressive llms are uh not the way
we're going to make progress towards
superhuman intelligence these are the
large language models like GPT 4 like
llama 2 and 3 soon and so on how do they
work and why are they not going to take
us all the way for a number of reasons
the first is that there is a number of
characteristics of intelligent
behavior for example the capacity to
understand the world understand the
physical world
the ability to remember and retrieve
things um persistent memory the ability
to reason and the ability to plan those
are four essential characteristic of
intelligent um systems or entities
humans
animals lnms can do none of those or
they can only do them in a very
primitive way and uh they don't really
understand the physical world don't
really have persistent memory they can't
really reason and they certainly can't
plan and so you know if if if you expect
the system to become intelligent just
you know without having the possibility
of doing those things you're making a
mistake that is not to say that auto
regressive LS are not useful they're
certainly
useful um that they're not interesting
that we can't build a whole ecosystem of
applications around them of course we
can but as a path towards human level
intelligence they're missing essential
components and then there is another
tidbit or or fact that I think is very
interesting those llms are trained on
enormous amounts of text basically the
entirety of all publicly available text
on the internet right that's
typically on the order of 10 to the 13
tokens each token is typically two byes
so that's two 10 to the 13 bytes as
training data it would take you or me
170,000 years to just read through this
at eight hours a day uh so it seems like
an enormous amount of knowledge right
that those systems can
accumulate
um but then you realize it's really not
that much data if you you talk to
developmental psychologist and they tell
you a four-year-old has been awake for
16,000 hours in his
life um
and the amount of information that has
uh reached the visual cortex of that
child in four
years um is about 10 to the 15 bytes and
you can compute this by estimating that
the optical nerve carry about 20 megab
megabytes per second roughly and so 10^
the 15 bytes for a four-year-old versus
2 * 10 to 13 bytes for 170,000 years
worth of reading what it tells you is
that uh through sensory input we see a
lot more information than we than we do
through
language and that despite our
intuition most of what we learn and most
of our knowledge is through our
observation and interaction with the
real world not through language
everything that we learn in the first
few years of life and uh certainly
everything that animals learn has
nothing to do with language so it would
be good to uh maybe push against some of
of the intuition behind what you're
saying
so it is true there's several orders of
magnitude more data coming into the
human
mind much faster and the human mind is
able to learn very quickly from that
filter the data very quickly you know
somebody might argue your comparison
between sensory data versus language
that language is already very compressed
it already contains a lot more
information than the bytes it takes to
store them if you compare it to visual
data so there's a lot of wisdom and
language there's words and the way we
stitch them together it already contains
a lot of information so is it possible
that language alone already has
enough wisdom and knowledge in there to
be able to from that language construct
a a world model and understanding of the
world an understanding of the physical
world that you're saying L LMS lack so
it's a big debate among uh philosophers
and also cognitive scientists like
whether intelligence needs to be
grounded in
reality uh I'm clearly in the camp that
uh yes uh intelligence cannot appear
without some grounding in uh some
reality doesn't need to be you know
physical reality could be simulated but
um but the environment is just much
richer than what you can express in
language language is a very approximate
representation of our percepts and our
mental models right I mean there there's
a lot of tasks that we accomplish where
we manipulate uh a mental model of uh of
the situation at hand and that has
nothing to do with language everything
that's physical mechanical whatever when
we build something when we accomplish a
task model task of you know grabbing
something Etc we plan or action
sequences and we do this by essentially
Imagining the result of the outcome of
sequence of actions so we might
imagine and that requires mental models
that don't have much to do with language
and that's I would argue most of our
knowledge is derived from that
interaction with the physical world so a
lot of a lot of my my colleagues who are
more uh interested in things like
computer vision are really on that camp
that uh AI needs to be embodied
essentially and then other people coming
from the NLP side or maybe you know some
some other uh motivation don't
necessarily agree with that um and
philosophers are split as well uh and
the U the complexity of the world is
hard to um hard to imagine it you know
it's hard to represent uh all the
complexities that we take completely for
granted in the real world that we don't
even imagine require intelligence right
this is the old marac Paradox from the
pioneer of Robotics and SMC we said you
know how is it that with computers it
seems to be easy to do high Lev complex
tasks like playing chess and solving
integrals and doing things like that
whereas the thing we take for granted
that we do every day um like I don't
know learning to drive a car or you know
grabbing an object we can do as
computers um
and you know we have llms that can pass
pass the bar exam so they must be smart
but then they can't learn to drive in 20
hours like any 17y old they can't learn
to clear out the dinner table and F of
the dishwasher like any 10-year-old can
learn in one shot um why is that like
you know what what are we missing what
what type of learning or or reasoning
architecture or whatever are we missing
that um um basically prevent us from
from you know having level five sing
Cars and domestic robots can a large
language model construct a world model
that does know how to drive and does
know how to fill a dishwasher but just
doesn't know how to deal with visual
data at this time so it it can operate
in space of Concepts so yeah that's what
a lot of people are working on so the
answer the short answer is no and the
more complex sensor is you can use all
kind of tricks to get uh uh an llm to
basically digest U visual
representations of representations of
images uh or video or audio for that
matter um and uh a classical way of
doing this is uh you train a vision
system in some way and we have a number
of ways to train Vision systems either
supervised semisupervised self superise
all kinds of different ways uh that will
turn any image into
high level
representation basically a list of
tokens that are really similar to the
kind of tokens that
uh typical llm takes as an input and
then you just feed that to the llm in
addition to the text and you just expect
LM to kind of uh you know during
training to kind of be able to uh use
those representations to help make
decisions I mean there been work along
those line for for quite a long time um
and now you see those systems right I
mean there are llms that can that have
some Vision extension but they're
basically hacks in the sense that um
those things are not like trained end to
end to to handle to really understand
the world they're not trained with video
for example uh they don't really
understand intuitive physics at least
not at the moment so you don't think
there's something special to about
intuitive physics about sort of Common
Sense reasoning about the physical space
about physical reality that's that to
you is a giant leap that llms are just
not able to do we're not going to be
able to do this with the type of llms
that we are uh working with today and
there's a number of reasons for this but
uh the main reason
is the way llm LMS are trained is that
you you take a piece of text you remove
some of the words in that text you Mass
them you replace by replace them by
blank markers and you train a gtic
neural net to predict the words that are
missing uh and if you build this neural
net in a particular way so that it can
only look at u words that are to the
left of the one is trying to predict
then what you have is a system that
basically is trying to predict the next
word in a text right so then you can
feed it um a text a prompt and you can
ask it to predict the next word it can
never predict the next word exactly and
so what it's going to do is uh produce a
probability distribution over all the
possible words in your dictionary in
fact it doesn't predict words it
predicts tokens that are kind of subword
units and so it's easy to handle the
uncertainty in the prediction there
because there's only a finite number of
possible words in the dictionary and you
can just compute a distribution over
them um then what you what the system
does is that it it picks a word from
that distribution of course there's a
higher chance of picking words that have
a higher probability within that
distribution so you sample from that
distribution to actually produce a word
and then you shift that word into the
input and so that allows the system not
to predict the second word right and
once you do this you shift it into the
input Etc that's called Auto regressive
prediction and which is why those llms
should be called Auto regressive llms uh
but we just call them
LMS
and there is a difference between this
kind of process and a process by which
before producing a word when you talk
when you and I talk you and I are
bilinguals M we think think about what
we're going to say and it's relatively
independent of the language in which
we're going to say when we when we talk
about like uh I don't know let's say a
mathematical concept or something the
kind of thinking that we're doing and
the answer that we're planning to
produce is not linked to whether we're
going to see it in French or Russian or
English chsky just rolled his eyes but I
understand so you're saying that there's
a a bigger abstraction that repes that's
uh that goes before language yeah maps
onto language right it's certainly true
for a lot of thinking that we that we do
is that obvious that we don't like
you're saying your thinking is same in
French as it is in English yeah pretty
much yeah pretty much or is this like
how how flexible are you like if if
there's a probability
distribution well it it depends what
kind of thinking right if it's just uh
if it's like producing puns I get much
better in French than English about that
no but so worse is an abstract
representation of puns like is your
humor an abstract like when you tweet
and your tweets are sometimes a little
bit spicy uh what's is there an abstract
representation in your brain of a tweet
before it maps onto English there is an
abstract representation of uh Imagining
the reaction of a reader to that uh text
or you start with laughter and then
figure out how to make that happen or
figure out like a reaction you want to
cause and and then figure out how to say
it right so that it causes that reaction
but that's like really close to language
but think about like a math mathematical
concept or um you know imagining you
know something you want to build out of
wood or something like this right the
kind of thinking you're doing has
absolutely nothing to do with language
really like it's not like you have
necessarily like an internal monologue
in any particular language you're you're
you know imagining mental models of of
the thing right I mean if I if I ask you
to like imagine what this uh water
bottle will look like if I rotate it 90
degrees
um that has nothing to do with
language and so uh so clearly there is
you know a more abstract level of
representation uh in which we we do most
of our thinking and we plan what we're
going to say if the output is
is you know uttered words as opposed to
an output being uh you know muscle
actions right um we we plan our answer
before we produce it and LMS don't do
that they just produce one word after
the other instinctively if you want it's
like it's a bit like the you know
subconscious uh actions where you don't
like you're distracted you're doing
something you're completely concentrated
and someone comes to you and you know
asks you a question and you kind of
answer the question you don't have time
to think about the answer but the answer
is easy so you don't need to pay
attention you sort of respond
automatically that's kind of what an llm
does right it doesn't think about it
sensor really uh it retrieves it because
it's accumulated a lot of knowledge so
it can retrieve some some things but
it's going
to just spit out one token after the
other without planning the answer but
you're making it sound just one token
after the other one token at a time
generation is uh bound to be
simplistic but if the world model is
sufficiently sophisticated that one
token at a
time the the most likely thing it
generates is a sequence of tokens is
going to be a deeply profound thing okay
but then that assumes that those systems
actually possess an internal World model
so it really goes to the I I think the
fundamental question is can you build a
a
really complete World model not complete
but a uh one that has a deep
understanding of the world yeah so can
you build this first of all by
prediction right and the answer is
probably yes can you predict can you
build it by predicting words and the
answer is most probably no because
language is very poor in terms or weak
or low bandwidth if you want there's
just not enough information there so
building World models means observing
the
world and uh
understanding why the world is evolving
the way the way it is and then uh the
the extra component of a world model is
something that can predict how the world
is going to evolve as a consequence of
an action you might take right so what
model really is here is my idea of the
state of the world at time te here is an
action I might take what is the
predicted state of the world at Mt plus1
now that state of the world doesn't does
not need to represent everything about
the world it just needs to represent
enough that's relevant for this planning
of of the action but not necessarily all
the details now here is the problem um
you're not going to be able to do this
with generative models so genery model
has trained on video and we've tried to
do this for 10 years you take a video
show a system a piece of video and then
ask you to predict the reminder of the
video basically predict what's going to
happen one frame at a time do the same
thing as sort of the autoaggressive llms
do but for video right either one FR at
a time or a group of friends at a time
um but yeah uh a large video model if
you
want uh the idea of of doing this has
been floating around for a long time and
at at Fair uh some colleagues and I have
been trying to do this for about 10
years um and you can't you can't really
do the same trick as with LM because uh
you know llms as I said you can't
predict exactly which word is going to
follow a sequence of words we can
predict the distribution over words now
if you go to video what you would have
to do is predict the distribution over
all possible frames in a video and we
don't really know how to do that
properly uh we we do not know how to
represent distributions over High
dimensional continuous spaces in ways
that are
useful uh and and that's that there lies
the main issue and the reason we can do
this is because the world is incredibly
more complicated and richer in terms of
information than than text text is
discret video is high dimensional and
continuous a lot of details in this um
so if I take a a video of this room uh
and the video is you know a camera
panning around MH um there is no way I
can predict everything that's going to
be in the room as I pan around the
system cannot predict what's going to be
in the room as the camera is panning
maybe it's going to predict this is this
is a room where there's a light and
there is a wall and things like that it
can't predict what the painting on the
wall looks like or what the texture of
the couch looks like certainly not the
texture of the carpet so there's no way
I can predict all those details so the
the way to handle this is one way
possibly to handle this which we've been
working for a long time is to have a
model that has
what's called a latent variable and the
latent variable is fed to an Nal net and
it's supposed to represent all the
information about the world that you
don't perceive yet and uh that you need
to
augment uh the the system for the
prediction to do a good job at
predicting pixels including the you know
fine texture of the of the carpet and
the on a couch and and the painting on
the wall
um uh that has been a complete failure
essentially and we've tried lots of
things we tried uh just straight neural
Nets we tried Gans we tried uh you know
Vees all kinds of regularized Auto
encoders we tried um many things we also
tried those kind of methods to learn uh
good representations of images or video
um that could then be used as input to
for example an image classification
system mhm and that also was basically
failed like all the systems that attempt
to predict missing parts of an image or
video um you know from a corrupted
version of it basically so right take an
image or a video corrupt it or transform
it in some way and then try to
reconstruct the complete video or image
from the corrupted
version and then hope that internally
the system will develop a good
representations of images that you can
use for object recognition segmentation
whatever it is
is that has been essentially a complete
failure and it works really well for
text that's the principle that is used
for LMS right so where is the failure
exactly is that that it's very difficult
to form a good representation of an
image a good in like a good embedding of
all all the important information in the
image is it in terms of the consistency
of image to image to image the image
that forms the video like where what are
the if we do a highlight reel of all the
ways you failed what what's that look
like okay so the reason this doesn't
work uh is first of all I have to tell
you exactly what doesn't work because
there is something else that does work
uh so the thing that does not work is
training a system to learn
representations of images by training it
to
reconstruct uh a good image from a
corrupted version of it okay that's what
doesn't work and we have a whole slew of
technique for this uh that are you know
variant of ding Auto encoders something
called Mee developed by some of my
colleagues at Fair Max Doo encoder so
it's basically like
the you know llms or or or or things
like this where you train the system by
corrupting text except you corrupt
images you remove Patches from it and
you train a gigantic neet to reconstruct
the features you get are not good and
you know they're not good because if you
now train the same architecture but you
train it supervised mhm with with uh
label data with Tex textual descriptions
of images Etc you do get good
representations and the performance on
recognition tasks is much better than if
you do this self-supervised free trining
so the architecture is good the
architecture is good the architecture of
the encoder is good okay but the fact
that you train the system to reconstruct
images does not lead it to produce to
learn good generic features of images
when you train in a self-supervised way
self-supervised by reconstruction Yeah
by reconstruction okay so what's the
alternative the alternative is joint
embedding what is joint embedding what
are what are these architectures that
you're so excited about okay so now
instead of training a system to encode
the image and then training it to
reconstruct the the full image from a
corrupted version you take the full
image you take
the corrupted or transformed version you
run them both through encoders mhm which
in general are identical but not
necessarily and then you you
train a predictor on top of those uh
encoders um to predict the
representation of the full input from
the representation of the corrupted
one okay so joint embedding because
you're you're taking the the full input
and the corrupted version or transform
version run them both through encoders
so you get a joint embedding and then
you and then you're you're saying can I
predict the representation of the full
one from the representation of the
corrupted one okay um and I call this a
JEA so that means joint embedding
predictive architecture because it's
joint embedding and there is this
predictor that predicts the
representation of the good guy from from
the bad
guy um and the big question is how do
you train something like this uh and
until five years ago or six years ago we
didn't have particularly good answers
for how you train those things except
for one um called contrastive
contrastive
learning
where um and the IDE contrastive
learning is you you take a pair of
images that are again an image and a
corrupted version or degraded version
somehow or transform version of the
original one and you train the predicted
representation to be the same as as that
if you only do this the system collapses
it basically completely ignores the
input and produces representations that
are con
so the contrastive methods avoid this
and and those things have been around
since the early 90s had a paper on this
in
1993 um is you also show pairs of images
that you know are different and then you
push away the representations from each
other so you say not only do
representations of things that we know
are the same should be the same or
should be similar but representation of
things that we know are different should
be
different and that prevents the collapse
but it has some limitation and there's a
whole bunch of uh techniques that have
appeared over the last six seven years
um that can revive this this type of
method um some of them from Far some of
them from from Google and other places
um but there are limitations to those
contrasting method what has changed in
the last
uh you know three four years is now now
we have methods that are non-contrastive
so they don't require those negative
contractive samples of images that are
that we know are different you can only
you turn them only with images that are
you know different versions or different
views of the same thing uh and you rely
on some other tricks to prevent the
system from collapsing and we have have
a dozen different methods for this now
so what is the fundamental difference
between joint embedding architectures
and llms so can uh can japa take us to
AGI whether we should say that you don't
like uh the term AGI and we'll probably
argue I think every single time I've
talked to you with argued about the G
and AGI yes get I get it I get it we
we'll probably continue to argue about
it it's great uh you you like uh I me
this because cuz you like French and um
I me is is is uh I guess friend in
French yes and Ami stands for advanced
machine intelligence right um but either
way can japa take us to that towards
that advanced machine intelligence well
so it's a it's a first step okay so
first of all uh what What's the
difference with generative architectures
like llms um so llms um
or Vision systems that are trained by
reconstruction generate the inputs right
they generate the original input that is
non-corrupted
non-transformed right so you have to
predict all the
pixels and there is a huge amount of
resources spent in the system to
actually predict all those pixels all
the
details uh in a jepa you're not trying
to predict all the pixels you're only
trying to predict an abstract
representation of of the inputs right
and that's much easier in many ways so
what the japa system when it's being
trained is trying to do is extract as
much information as possible from the
input but yet only EXT ract information
that is relatively easily
predictable okay so there's a lot of
things in the world that we cannot
predict like for example if you have a s
driving car driving down the street or
road uh there may be uh trees around the
around the road and it could be a windy
day so the the leaves on the tree are
kind of moving in kind of semi chaotic
random ways that you can't predict and
you don't care you don't want to predict
so what you want is your encoder to
basically eliminate all those details
will tell you there's moving leaves but
it's not going to keep the details of
exactly what's going on um and so when
you do the prediction in representation
space you're not going to have to
predict every single Pixel of a
relief and that you know um not only is
a lot simpler but also it allows the
system to essentially learn an abstract
representation of of the world where you
know what can be modeled and predicted
is preserved and the rest is viewed as
noise and eliminated by the encoder so
it kind of lifts the level of
abstraction of the representation if you
think about this this is something we do
absolutely all the time whenever we
describe a phenomenon we describe it at
a particular level of abstraction and we
don't always describe every natural
phenomenon in terms of quantum field
Theory right that would be impossible
right so we have multiple levels of
abstraction to describe what happens in
the world you know starting from Quantum
field Theory to like atomic theory and
molecules you know in chemistry
materials and you know all the way up to
you know kind of concrete objects in the
real world and things like that so the
we we can't just only model everything
at the lowest level and that that's what
the idea of JEA is really on is really
about learn abstract representation in a
self-supervised uh Manner and you know
you can do it hierarchically as well so
that I think is an essential component
of an intelligent system and in language
we can get away without doing this
because language is already to some
level abstract and already has
eliminated a lot of information that is
not predictable and um so we can get
away without doing the tring without you
know lifting the abstraction level and
by directly predicting words so joint
embedding it's still generative but it's
generative in this abstract
representation space yeah and you're
saying language we were lazy with
language cuz we already got the abstract
representation for free and now we have
to zoom out actually think about
generally intelligent systems we have
to deal with a full mess of physical
reality of reality and you can't you you
do have to do this step of jumping
from uh the full Rich detailed reality
to a uh abstract representation of that
reality based on which you can then
reason and all that kind of stuff right
and the thing is those cell supervised
algorithm that that learn by
prediction even in representation space
uh they learn more uh concept if the
input data you Feit them is more
redundant the more redundancy there is
in the data the more they're able to
capture some internal structure of it
and so there there is way more
redundancy in structure in perceptual uh
inputs sensory input like like like
Vision than there is in
uh text which is not nearly as redundant
this is back to the question you were
asking a few minutes ago language might
represent more information really
because it's already compressed you're
you're right about that but that means
it's also less redundant and so self
supervision will not work as well is it
possible to join the self-supervised
training on visual data and
self-supervised training on language
data there is a huge amount of knowledge
even though you talk down about those 10
to the 13 tokens those 10 to the 13
tokens represent the
entirety a large fraction of what US
humans have figured out both the
talk on Reddit and the contents of all
the books and the Articles and the full
spectrum of
human uh intellectual creation so is it
possible to join those two together well
eventually yes but I think uh if we do
this too early
we run the risk of being tempted to
cheat and in fact that's what people are
doing at the moment with vision language
model we're basically cheating we are
using uh language as a crutch to help
the deficiencies of our uh Vision
systems to kind of learn good
representations from uh images and video
and uh the problem with this is that we
might you know improve our uh visual
language system a bit I mean our
language models by you know feeding them
image
but we're not going to get to the level
of even the intelligence or level of
understanding of the world of a cat or
dog which doesn't have language you know
they don't have language and they
understand the world much better than
any llm they can plan really complex
actions and sort of imagine the result
of a bunch of actions how do we get
machines to learn that before we combine
that with language obviously if we
combine this with language this is going
to be a winner um but but before that we
have to focus on like how do we get
systems to learn how the world works so
this kind of joint embedding predictive
architecture for you that's going to be
able to learn something like Common
Sense something like what a cat uses to
predict how to mess with its owner most
optimally by knocking over a thing
that's that's the Hope in fact the
techniques we're using are
non-contrastive uh so not only is the
architecture non generative the learning
procedures we're using are non
contrastive we have two two sets of
techniques one set is based on
distillation and there's a number of uh
methods that use this principle uh one
by Deep Mind
Bol a couple by by Fair one one called
uh
VRA and another one called IA and vcra I
should say is not a distillation method
actually but IA and B certainly are and
there's another one also called Dino or
dyo also produced from at fair and the
idea of those things is that you take
the full input let's say an image uh you
run it through an
encoder uh produces a representation and
then you corrupt that input or transform
it running to the essentially what
amounts to the same encoder with some
minor
differences and then train a predictor
sometimes to predictor is very simple
sometime doesn't exist but train a
predictor to predict a representation of
the first first uh uncorrupted input
from the corrupted input um but you only
train the the second Branch um you only
train the part of the network that is
fed with the corrupted input the other
network you don't you don't train but
since they share the same weight when
you modify the first one it also
modifies the second one uh and with
various tricks you can prevent the
system from collapsing uh with the
collapse of the type I was explaining
before where the system basically
ignores the input
um so that works very well the the
technique with the two techniques we
develop at Fair uh dino and uh and IA
work really well for that so what kind
of data are we talking about here so
this the several scenario one uh one
scenario is you take an image you
corrupt it by um changing the cropping
for example changing the size a little
bit maybe changing the orientation
blurring it changing the colors doing
all kinds of horrible things to it but
basic horrible things basic horrible
things that sort of degrade the quality
a little bit and change the framing uh
you know crop the image um or and in
some cases in the case of a JEA you
don't need to do any of this you just
you just mask some parts of it right you
just basically remove some regions like
a big block
essentially and and then you know run
through the encoders um and train the
entire system and and predictor to
predict the representation of the good
one from the representation of the
corrupted
one um so that's the Ia doesn't need to
know that it's an image for example
because the only thing it needs to know
is how to do this masking um whereas
with doo you need to know it's an image
because you need to do things like you
know geometri transformation and
blurring and things like that that are
really image
specific uh a more recent version of of
this that we have is called V JEA so is
basically the same idea as I except um
it's applied to video so now you take a
whole video and you mask a whole chunk
of it and what we mask is actually kind
of a temple tube so an all like a whole
uh segment of each frame in the video
over the entire video and that tube was
like statically position throughout the
frames lit straight tube the tube yeah
typically is 16 frames or something and
we mask the same region over the entire
16 frames it's a different one for every
video obviously and um and then again
train that system so as to predict the
representation of the full video from
The partially matched video uh that
works really well it's the first system
that we have that learns good
representations of video so that when
you feed those representations to a
supervised uh classifier head it can it
can tell you what action is taking place
in the video with you know pretty good
accuracy um so that that's it's the
first time we get something of that uh
of that quality so that that's a a good
test that a good representation is
formed that means there's something to
this yeah um we also preliminary result
that seem to indicate that the
representation allows us allow our
system to tell whether the video is
physically possible or completely
impossible because some object
disappeared or an object you know
suddenly jumped from one location to
another or or change shape or something
so it's able to capture some physical
con some physic based constraints about
the reality represented in the video
yeah about the appearance and The
Disappearance of objects yeah that's
really you okay but C can this
actually get us to this kind of uh World
model
that understands enough about the world
to be able to drive a car uh possibly um
this is going to take a while before we
get to that point but um um and there
are systems already you know everybody
systems that are based on this uh idea
uh and the what you need for this is a
slightly modified version of this where
um imagine that you
have uh a video and the a complete video
and what you're doing to this video is
that you're either translating it in
time towards the future so you only see
the beginning of the video but you don't
see the latter part of it that is in the
original one or you just mask the second
half of the video for example um and
then you you train a JEA system of the
type I describe to predict the
representation of the full video from
the the shifted one but you also feed
the predictor with an action for example
you know the wheel is turned 10 degrees
to the to the right or something right
so if it's a you know a dash cam in a
car and you know the angle of the wheel
you should be able to predict to some
extent what's going what's going to go
what's going to happen to which to see
uh you're not going to be able to
predict all the details of you know
objects that appear in the view
obviously but at a abstract
representation level you can you can
probably predict what's going to happen
so now what you have
is a internal model that says here is my
idea of state of the world at time T
here is an action I'm taking here's a
prediction of the state of the world at
time t plus one t plus Delta t t plus 2
seconds whatever it is if you have a
model of this type you can use it for
planning so now you can do what llms
cannot do which is planning what you're
going to do so as to arrive at a
particular uh outcome or satisfy a
particular objective right so you can
have a number of
objectives um right if you know I can I
can predict that uh if I have uh an
object like this right and I open my
hand it's going to fall right and uh and
if if I push it with a particular force
on the table it's going to move if I
push the table itself it's probably not
going to move uh with the same Force um
so we have we have this internal model
of the world in our in our mind uh which
allows us to plan sequences of actions
to arrive at a particular goal um and so
um so now if you have this world model
we can imagine a sequence of actions
predict what the outcome of the sequence
of action is going to be measure to what
extent
the final State satisfies a particular
objective like you know moving the
bottle to the left of the
table um and then plan a sequence of
actions that will minimize this
objective at run time we're not talking
about learning we're talking about
inference time right so this is planning
really and in optimal control this is a
very classical thing it's called Uh
model predictive control you have a
model of the system you want to control
that you know can predict the sequence
of State St corresponding to a sequence
of
commands and you're planning a sequence
of commands so that according to your
world model the the the end state of the
system will uh satisfy an objectives
that you fix this is the
way uh you know rocket trajectories have
been planned since computers have been
around so since the early 60s
essentially so yes for model predictive
control but you also often talk about
hierarchical planning can hierarchical
planning emerge from this somehow well
so no you you will have to build
specific architecture to allow for
hierarchical planning so hierarchical
planning is absolutely necessary if you
want to plan complex
actions uh if I want to go from let's
say from New York to Paris this the
example I use all the time and I'm
sitting uh in my office at NYU my
objective that I need to minimize is my
distance to Paris at a high level a very
astract representation of my uh my
location I would have to decompose this
into two sub goals first one is um go to
the airport second one is catch a plane
to Paris okay so my sub goal is
now uh going to the airport my objective
function is my distance to the
airport how do I go to the airport where
I have to go in the street and H a taxi
which you can do in New
York um okay now I have another sub goal
go down on the street uh well that means
going to the elevator going down the
elevator walk out the street how do I go
to the elevator I have to
uh stand up from my chair open the door
of my office go to the elevator push
push the button how do I get up from my
chair like you know you can imagine
going down all the way down to basically
what amounts to millisecond by
millisecond muscle
control okay and obviously you're not
going to plan your entire trip from New
York to Paris
in terms of millisecond by millisecond
muscle control first that would be
incredibly expensive but it will also be
completely impossible because you don't
know all the conditions of what's going
to happen uh you know how long it's
going to take to catch a taxi um or to
go to the airport with traffic you know
uh I mean you you would have to know
exactly the condition of everything to
be able to do this planning and you
don't have the information so you you
have to do this hierarchical planning so
that you can start acting and then sort
of replanning as you go and nobody
really knows how to do this in AI um
nobody knows how to train a system to
learn the appropriate multiple levels of
representation so that hierarchical
planning Works does something like that
already emerge so like can you use an
llm state-ofthe-art llm to get you from
New York to Paris by doing exactly the
kind of detailed set of questions that
you just did
which is can you give me a highight a
list of 10 steps I need to do to get
from New York to Paris and then for each
of those steps can you give me a list of
10 steps how I make that step happen and
for each of those steps can you give me
a list of 10 steps to make each one of
those until you're moving your mus
individual muscles uh maybe not whatever
you can actually act upon using your
mind right so there's a lot of questions
that are sort implied by this right so
the first thing is llms will be able to
answer some of those questions down to
some level of
exraction under the condition that
they've been trained with similar
scenarios in their training set they
would be able to answer all those
questions but some of them may be
hallucinated meaning non-factual yeah
true I mean they will probably produce
some answer except they're not going to
be able to really kind of produce
millisecond by millisecond muscle
control of how you how you stand up from
your chair right so but down to some
level of exraction we can describe
things by words they might be able to
give you a plan but only under the
condition that they've been trained to
produce those kind of plans mhm right
they're not going to be able to plan for
situations where that that they never
encountered before they basically are
going to have to regurgitate the
template that they've been trained on
but where like just for the example of
New York to Paris is is it going to
start getting into trouble like at which
layer layer of abstraction do you think
you'll start cuz like I can imagine
almost every single part of that anal
will be able to answer somewhat
accurately especially when you're
talking about New York and Paris major
cities so I mean certainly uh LM would
be able to solve that problem if you f
tun need for it you know just uh and and
so uh I can't say that nlm cannot do
this it can do this if you train it for
it there's no question uh down to a
certain level where things can be
formulated in terms of words but like if
you want to go down to like how do you
you know climb down the stairs or just
stand up from your chair in terms of uh
words like you you can't you can't do it
um you you need that's one of the
reasons you need experience of the
physical world which is much higher
bandwidth than what you can express in
words in human language so everything
we've been talking about on the joint
embedding space is it possible that
that's what we need for like the
interaction with physical reality for on
the robotics front and then just the
llms are the thing that sits on top of
it for the bigger reasoning about like
yeah the fact that I need to book a
plane ticket and I need to know I know
how to go to the websites and so on sure
and you know a lot of plans that people
know about um that are relatively high
level are actually learned they're not
people most people don't invent the you
know plans um uh they
they by themselves they uh you know we
have some ability to do this of course
uh obviously but um but but most plants
that people use are plants that they've
been trained on like they've seen other
people use those plants or they've been
told how to do things right um that you
can't invent how you like take a person
who's never heard of airplanes and tell
them like how do you go from New York to
Paris and they're probably not going to
be able to kind of you know deconstruct
the whole plan unless they've seen
examples of that before um so certainly
LMS are going to be able to do this but
but then um how you link this from the
the low level
of of of actions uh that needs to be
done with things like like Jad that
basically lift the abstraction level of
the representation without attempting to
reconstruct every detail of the
situation that's why we need Jass for I
would love to sort of Linger on your
skepticism
around uh autoaggressive
llms so one way I would like to test
that skepticism is everything you say
makes a lot of sense
but if I apply everything you said today
and in general to like I don't know 10
years ago maybe a little bit less no
let's say three years ago I wouldn't be
able to
predict the uh success of llms so does
it make sense to you that autoaggressive
llms are able to be so damn
good yes can you explain your intuition
because if I were to take your wisdom
and
intuition at face value I would say
there's no way autoaggressive LMS one
token at a time would be able to do the
kind of things they're doing no there's
one thing that auto llms uh or that llms
in general not just the autoaggressive
one but including the birth style bir
directional ones uh are exploiting and
it's self-supervised learning and I've
been a very very strong advocate of self
supervising for many years so those
things are
a incredibly impressive demonstration
that cell supervisor learning actually
works uh the idea that you know started
uh it didn't start with with uh with
Bert but it was really kind of a good
demonstration with this so the
the the idea that you know you take a
piece of text you corrupt it and then
you train some gigantic neural net to
reconstruct the parts that are
missing um that has been an enormous
uh produced an enormous amount of
benefits uh it allowed allowed us to
create systems that understand
understand language uh systems that can
translate um hundreds of languages in
any direction systems that are
multilingual so they're not it's a
single system that can be trained to
understand hundreds of languages and
translate in any
direction um and produce summaries um
and then answer questions and produce
text and then there's a special case of
it where you know you which is the auto
Progressive uh trick where you constrain
the system to not elaborate a
representation of the text from looking
at the enti text but
only predicting a word from the words
that are come before right and you do
this by the constraining the
architecture of the network and that's
what you can build an auto regressive
ATM from so there was a surprise many
years ago with what's called decoder
only llm so since you know systems of
this type that are just trying to
produce uh words from the from the
previous one and and the fact that when
you scale them up they they tend
to really kind of understand more about
the about language when you train them
on lot of data and you make them really
big that was kind of a surprise and that
surprise occurred quite a while back
like you know uh with uh work from uh
you know Google meta open AI Etc you
know going back to you know the GPT kind
of uh work General pre-train
Transformers do you mean like gbt2 like
there's a certain place where you start
to realize scaling might actually keep
giving us a an emergent benefit yeah I
mean there were there were work from
from various places but uh uh if if you
want to kind of you know place it in the
in the GPT uh timeline that would be
around gpt2 yeah well I just cuz you
said it you're you're so charismatic you
said so many words but self-supervised
learning yeah yes but again the same
intuition you're applying to saying that
autor regressive llms cannot have a deep
understanding of the world if we just
apply that same intuition does it make
sense to you that they're able to form
enough of a representation of the world
to be damn convincing
essentially passing the original touring
test with flying colors well we're
fooled by their fluency right we just
assume that if a system is is fluent in
manipulating language then it has all
the characteristics of human
intelligence but that impression is
false we we we're really fooled by it um
what do you think alen tan would say it
without understanding anything just
hanging out with it an Turing would
decide that a Turing test is a really
bad test okay this is what the AI
Community has decided many years ago
that the tring test was a really bad
test of intelligence what would Hans
marvac say about the about the large
language models hence Marv would say the
Marv Paradox still applies okay okay
okay we can pass you don't think he
would be really impressed no of course
everybody would be impressed but uh you
know uh it's not a question of being
impressed or not it's a question of
knowing what the limit of those systems
can do like there again they are
impressive they can do a lot of useful
things there's a whole industry that is
being built around them they're going to
make progress uh but there is a lot of
things they cannot do and we have to
realize what they cannot do do and uh
and then figure out you know how we get
there and you know and and I'm not
seeing this I'm seeing this from
basically you know 10 years of of
research uh on on the IDE of sell
supervis learning actually that's going
back more than 10 years but the IDE of
cell supervis learning so basically
capturing the internal structure of a
piece of uh of of a set of inputs
without training the system for any
particular task right learning
representations
um you know the the conference I
co-founded 14 years ago is called inter
International Conference on learning
representations that's the entire issue
that deep learning is is dealing with
right and it's been my obsession for you
know almost 40 years now so um so
learning representation is really the
thing uh for the longest time we could
only do this with supervised learning
and then we started working on uh you
know what we used to call unsupervised
learning uh and sort of revive the idea
of unsupervised learning uh in the early
2000s with yosha benju and Jeff Hinton
then discovered that supervisor leing
actually works pretty well if you can
collect enough data and so the whole
idea of you know unsupervised supervisor
kind I took a a backseat for for a bit
and then I kind of tried to revive it um
uh in a big way you know starting in
2014 basically when we started fair and
uh and really pushing for like finding
new new methods to do cell supervised
learning both for text and for images
and for video and audio and some of that
work has been incredibly successful um I
mean the reason why we have multilingual
translation system you know things to do
content moderation on on meta for
example on Facebook that are
multilingual that understand whether
piece of text is H speech or not or
something is due to their progress using
cell supervis learning for NLP combining
this with you know Transformer
architectures and and blah blah blah but
that's the big success of supervis rning
we had similar success in speech
recognition a system called wave to V
which is also a joint embedding
architecture by the way train with
contrastive learning and and that that
system also can produce um speech
recognition systems that are
multilingual with mostly unlabeled data
and only need a few minutes of labeled
data to actually do speech recognition
that's that's amazing um we have systems
now based on those combination of ideas
that can do realtime translation of
hundreds of languages into each other uh
Speech to speech speech to speech even
including which is fascinating languages
that uh don't have written forms that's
right they spoken only that's right we
don't go through text it goes directly
from from speech to speech using an
internal representation of kind of
speech units that are discrete but it's
um it's called text lesson LP we used to
call it this way but um yeah so that I
mean incredible success there and then
you know for 10 years we tried to apply
this idea to learning representations of
images by training a system to predict
videos learning intuitive physics by
training a system to predict what's
going to happen in the video and tried
and tried and failed and failed with
generative models with models that
predict
pixels uh we could not get them to learn
good well presentations of images we
could not get them to learn good well
presentations of videos and we tried
many times we published lots of papers
on it you know they kind of sort of work
but not really great they started
working we we have been this idea of
predicting every pixel and basically
just doing the joint embedding and
predicting in representation space that
works MH so there's ample evidence that
we're not going to be able to learn good
we representations of the real world
using generative model so I'm telling
people everybody is talking about
generative AI if you're really
interested in human level AI abandon the
idea of generate
AI okay but you you you really think
it's possible to get far with the joint
embedding representation so like there's
Common Sense reasoning and then there's
highlevel reasoning like I I feel like
those are
two the kind of reasoning that LMS are
able to do okay let me not use the word
reasoning but the kind of stuff that LMS
are able to do seems fundamentally
different than the common sense
reasoning we use to navigate the world
yeah it seems like we're going to need
both you're not would you be able to get
with the joint embedding would is the
JEA type of approach looking at video
would you be able to
learn let's see well how to get from New
York to Paris or
um how to uh understate understand the
state of politics in the world today
right these these are things where
various humans generate a lot of
language and opinions on in the space of
language but don't visually represent
that and you clearly uh compressible way
right well there's a lot of situations
that you know might be difficult to for
a purely language based system to um to
know like okay you can probably learn
from Reading text the entirety of the
Public Public avilable text in the world
that I cannot get from New York to Paris
by snapping my fingers that's not going
to work right yes uh but there's you
know probably sort of more complex uh
scenarios of this type which an nlm May
never have encountered and may not be
able to determine whether it's possible
or not um so um so that that link you
know from the the low level to the high
level the the thing is that the high
level that language expresses is based
on the common experience of the low
level which llms currently do not have
you know we when we talk to each other
we know we have a common experience of
the of the world like you know a lot of
it is is similar
uh
and LMS don't have that but see there
it's present you and I have a common
experience of the world in terms of the
physics of how gravity works and stuff
like this and
that common knowledge of the world I
feel like is there in the language we
don't explicitly express it but if you
have a huge amount of text you're going
to get this stuff that's between the
lines you're going to you're going in
order to um form a consistent world mod
you're going to have to understand how
gravity works even if you don't have an
explicit explanation of gravity so even
though in the case of gravity there is
explicit explanations of gravity and
wiia but uh you're like the stuff that
we think of as common sense reasoning I
feel like to generate language correctly
you're going to have to figure that out
now you could say as you have there's
not enough text okay so what you don't
think so no I agree with what you just
said which is that to be able to do high
LEL um uh common sense to have high
level common sense you need to have the
low level common sense to build on top
of yeah um but that's not there and
that's not there in llms llms are purely
trained from Tex so so then the other
statement you made um I would not I
would not agree with the fact that
implicit in all languages in the world
is the underlying reality there's a lot
about underlying reality which is not
expressed in language is that obvious to
you yeah
totally so like
all all the conversations we have what
okay there's the Dark web meaning uh
whatever the private conversations like
DMS and stuff like this which is much
much larger probably than what's
available what what llms are trained on
you don't need to communicate the stuff
that is coming but the humor all of it
no you do like when you you don't need
to but it comes through through like you
like if I accidentally uh knock this
over you'll probably make fun of me and
in the content of the you making fun of
me will be a explanation of the fact
that cups fall and then you know gravity
Works in this way and then you you'll
have some very vague information about
what kind of things explode when they
hit the ground and then maybe you'll
make a joke about entropy or something
like this and you will'll never be able
to reconstruct this again like okay you
make a a little joke like this and
there'll be trillion of other jokes and
from the jokes you can piece together
the fact that gravity works and mugs can
break and all this kind of stuff you
don't need to
see uh it'll be very inefficient it's
easier for like to not knock the thing
over yeah but uh I feel like it would be
there if you have enough of that data I
just think that most of the information
of this type that we have accumulated
when when we were babies is just not
present in uh in in text any in any
description essentially and the sensory
data is much is a much richer source for
getting that kind of understanding I
mean that's the 16,000 hours of of wake
time of a four-year-old and uh 10 to the
15 bytes you know going through vision
just Vision right there is a
similar uh bandwidth you know of touch
and uh a little less through audio and
then text doesn't language doesn't come
in until like you know year uh in in
life and by the time you are 9 years old
you've learned about gravity you know
about inertia you know about gravity you
know the stability you know you know
about the distinction between animate
and inanimate objects you know by 18
months you know about like uh why people
want to do things and you help them if
they can't you know I mean there's a lot
of things that you learn mostly by
observation really uh not even through
interaction in the first few months of
life babies don't don't really have any
influence on the world they can only
observe right and you accumulate like a
gigantic amount of uh of knowledge just
just from that so that that's what we're
missing from uh current AI
systems I think in one of your slides
you have this nice plot that is one of
the ways you show that llms are limited
I wonder if you could talk about
hallucinations from your
perspectives the why hallucinations
happen from large language models and
why
and to what degree is that a fundamental
flaw of large language models right so
because of the auto regressive
prediction every time an LM produces a
token or word uh there is some level of
probability for that word to take you
out of the set of reasonable
answers uh and if you assume which is a
very strong assumption that the
probability of such
error um is that those errors are
independent across a a sequence of
tokens being produced M what that means
is that every time you produce a token
the probability that you rest you you
stay within the the set of correct
answer decreases and it decreases
exponentially so there's a strong like
you said assumption there that if uh
there's a non-zero probability of making
a mistake which there appears to be then
there's going to be a kind of drift yeah
and that drift is exponential it's like
errors accumulate right so so the
probability that an answer would be
nonsensical increases exponentially with
the number of tokens is that obvious to
you by the way like well so
mathematically speaking maybe but like
isn't there a kind of gravitational pull
towards the truth because on on average
hopefully the truth is well represented
in the uh training set no it's basically
a struggle against uh the curse of
dimensionality so the way you can
correct for this is that you fine-tune
the
system by having it produce answers for
all kinds of questions that people might
come up with M and people are people so
they a lot of the questions that they
have are very similar to each other so
you can probably cover you know 80% or
whatever of questions that people will
will ask um by you know collecting data
and then um and then you fine tune the
system to produce good answers for all
of those things and it's probably going
to be able to learn that because it's
got a lot of capacity to to learn um but
then there is you know the enormous set
of prompts that you have not covered
during training and that set is enormous
like within the set of all possible
prompts the proportion of prompts that
have
been uh used for training is absolutely
tiny um it's a it's a tiny tiny tiny
subset of all possible prompts and so
the system will behave properly on the
prompts that it's been either trained
pre-trained or
fine-tuned um but then there is an
entire space of things that it cannot
possibly have been trained on because
it's just the the number is gigantic so
um so whatever training the
system uh has been subject to to produce
appropriate tensors you can break it by
finding out a prompt that will be
outside of the the the set of promps has
been trained on or things that are
similar and then it will just P complete
nonsense do you when you say prompt do
you mean that exact prompt or do you
mean a prompt that's like in many parts
very different than like is that easy to
ask a question or to say a thing that
hasn't been said before on the internet
I mean people have come up with uh
things where like you you put a
essentially a random sequence of
characters in The Prompt and that's
enough to kind of throw the system uh
into a mode where you know it it's going
to answer something completely different
than it would have answered without this
so that's a way to jailbreak the system
basically get it you know go outside of
its uh of its conditioning right so that
that's a very clear demonstration of it
but of
course uh you know that's uh that goes
outside of what is designed to do right
if you actually stitch together
reasonably grammatical sentences is that
the is it that easy to break it yeah
some people have done things like you
you you write a sentence in English
right that has and or you ask a question
in English and it it produces a
perfectly fine answer and then you just
substitute a few
words by the same word in another
language and all of a sudden the answer
is complete nonsense yeah so so I guess
what I'm saying is like which fraction
of prompts that humans are likely to
generate are going to break the system
so the the problem is that there is a
long tail yes uh this is a an issue that
a lot of people have realize you know in
social networks and stuff like that
which is uh there's a very very long
taale of of things that people will ask
and you can find tune the system for the
80% or whatever of uh of the things that
most people will will ask and then this
long tail is is so large that you're not
going to be able to fun the system for
all the conditions and in the end the
system has a being kind of a giant
lookup table right essentially which is
not really what you want you want
systems that can reason certainly they
can plan so the type of reasoning that
takes place in llm is very very
primitive and the reason you can tell is
primitive is because the amount of
computation that is spent per token
produced is constant so if you ask a
question and that question has an answer
in a given number of token the amount of
competition devoted to Computing that
answer can be exactly
estimated it's like you know it's it's
the the size of the prediction Network
you know with its 36 layers on 92 layers
or whatever it is uh multiply by number
of tokens that's it and so essentially
it doesn't matter if the question being
asked is is simple to answer complicated
to answer impossible to answer because
it's undecidable or something um the
amount of computation the system will be
able to devote to to the answer is
constant or is proportional to number of
token produced in the answer right this
is not the way we work the way we reason
is that when we're faced with a complex
problem or a complex question we spend
more time trying to solve it and answer
it right because it's more difficult
there's a prediction element there's a
iterative element where you're like
uh adjusting your understanding of a
thing by going over over and over and
over there's a hierarchical element so
on does this mean that a fundamental
flaw of llms or does it mean
that there more part to that
question now you're just behaving like
an
llm immediately answer no that that it's
just the lowlevel world model on top of
which we can then build some of these
kinds of mechanisms like you said
persistent long-term memory
or uh reasoning so on but we need that
world model that comes from language is
it maybe it is not so difficult to build
this kind of uh reasoning system on top
of a well constructed World model OKAY
whether it's difficult or not the near
future will will say because a lot of
people are working on reasoning and
planning abilities for for dialog
systems um I mean if we're even if we
restrict ourselves to
language uh just having the ability to
plan your answer before you
answer uh in terms that are not
necessarily linked with the language
you're going to use to produce the
answer right so this idea of this mental
model that allows you to plan what
you're going to say before you say it
um that is very important I think
there's going to be a lot of systems
over the next few years that are going
to have this capability but the
blueprint of those systems would be
extremely different from autoregressive
LMS so so
um it's the same difference as the
difference between what psychology is
called system one and system two in
humans right so system one is the type
of task that you can accomplish without
like deliberately consciously think
about how you do them you just do them
you've done them enough that you can
just do it subconsciously right without
thinking about them if you're an
experience driver you can drive without
really thinking about it and you can
talk to someone at the same time or
listen to the radio
right um if you are a very experienced
chess player you can play against a
non-experienced chess player without
really thinking either you just
recognize the pattern and you play MH
right that's system one um so all the
things that you do instinctively without
really having to deliberately plan and
think about it and then there is all
tasks what you need to plan so if you
are not to experienced uh chess player
or you are experienced with you play
against another experienced chess player
you think about all kinds of options
right you you think about it for a while
right and you you you're much better if
you have time to think about it than you
are if you if you play Blitz uh with
limited time so and um so this type of
deliberate uh planning which uses your
internal World model um that system to
this is what LMS currently cannot do so
how how do we get them to do this right
how do we build a system that can do
this kind of uh planning that or
reasoning that devotes more resources to
complex problems than to simple problems
and it's not going to be Auto regressive
prediction of tokens it's going to be
more something akin to inference of
Laten variables in um you know what used
to be called problemistic models or
graphical models and things of that type
so basically the principle is like this
you you know the prompt is like a
observed uh variables M and what your
what the model does is that it's
basically a
measure of it can measure to what extent
an answer is a good answer for a prompt
okay so think of it as some gigantic
neural net but it's got only one output
and that output is a scalar number which
is let's say zero if the answer is a
good answer for the question and a large
number if the answer is not a good
answer for the question imagine you had
this model if you had such a model you
could use it to produce good answers the
way you would do
is you know produce the pumpt and then
search through the space of possible
answers for one that minimizes that
number um that's called an energy based
model but that energy based model would
need the the model constructed by the
llm well so uh really what you need to
do would be to not uh search over
possible strings of text that minimize
that uh energy but what you would do it
do this in abstract representation space
so in in sort of the space of abstract
thoughts you would elaborate a thought
right using this process of minimizing
the output of your your model okay which
is just a scalar um it's an optimization
process right so now the the way the
system produces its answer is through
optimization um by you know minimizing
an objective function basically right uh
and this is we're talking about
inference we're not talking about
training right the system has been
trained already so now we have an
abstract representation of the thought
of the answer representation of the
answer we feed that to basically an auto
regive decoder uh which can be very
simple that turns this into a text that
expresses this thought okay so that that
in my opinion is the blueprint of future
dialog systems um they will think about
their answer plan their answer by
optimization before turning it into
text uh and that is tur complete can you
explain exactly what the optimization
problem there is like what's the
objective function just link on it you
you kind of briefly described it but
over what space are you optimizing the
space of
representations those abstract
representation abstract repr so you have
an abstract representation inside the
system you have a prompt The Prompt goes
through an encoder produces a
representation perhaps goes through a
predictor that predicts a representation
of the answer of the proper answer but
that representation may not be a good
answer because there might there might
be some complicated reasoning you need
to do right so um so then you have
another process that takes the
representation of the answers and
modifies it so as to
minimize uh a cost function that
measures to what extent the answer is a
good answer for the question now we we
sort of ignore the the fact for I mean
the issue for a moment of how you train
that system to measure whether an answer
is a good answer for for a question but
suppos such a system could be created
but what's the process this kind of
search like process it's a optimization
process you can do this if if the entire
system is
differentiable that scalar output is the
result of you know running through some
neural net mhm uh running the answer the
representation of the answer to some
neural net then by gradient descent by
back propag back propagating gradients
you can figure out like how to modify
the representation of the answer so has
to minimize that so that's still a
gradient based it's gradient based
inference so now you have a
representation of the answer in abstract
space now you can turn it into
text right and the cool thing about this
is that the representation now can be
optimized through gr and descent but so
is independent of the language in which
you're going to express the
answer right so you're operating in the
substract representation I mean this
goes back to the Joint embedding right
that is better to work in the uh in the
space of I don't know to romanticize the
notion like space of Concepts versus
yeah the space of
concrete sensory information
right okay but this can can this do
something like reasoning which is what
we're talking about well not really in
only in a very simple way I mean
basically you can think of those things
that's doing the kind of optimization I
was I was talking about except the
optimize in a discrete space which is
the space of possible sequences of of
tokens and they do it they do this
optimization in a horribly inefficient
way which is generate a lot of
hypothesis and then select the best ones
and that's
incredibly wasteful in terms of uh
computation because you have you run you
basically have to run your LM for like
every you know Genera sequence um and
it's incredibly wasteful um so it's much
better to do an optimization in
continuous space where you can do great
and descent as opposed to like generate
tons of things and then select the best
you just iteratively refine your answer
to to go towards the best right that's
much more efficient you can only do this
in continuous spaces with differentiable
functions you're talking about the
reasoning like ability to think deeply
or to reason
deeply how do you know
what is an
answer uh that's better or worse based
on deep reasoning right so then we're
asking the question of conceptually how
do you train an energy base model right
so en based model is a function with a
scalar output just a
number you give it two inputs X and Y
and it tells you whether Y is compatible
with X or not X You observe let's say
it's a pump image video whatever and why
is a proposal for an answer a
continuation of video um you know
whatever and it tells you whether Y is
compatible with X and the way it tells
you that Y is compatible with X is that
the output of that function would be
zero if Y is compatible with X it would
be a positive number non zero if Y is
not compatible with X okay how do you
train a system like this at a completely
General level is you show it pairs of X
and Y that are compatible a question and
the corresponding answer and you train
the parameters of the big neural net
inside U to produce
zero okay now that doesn't completely
work because the system might decide
well I'm just going to say zero for
everything so now you have to have a
process to make sure that for a a wrong
y the energy would be larger than zero
and there you have two options one is
contrastive Method so contrastive method
is you show an X and a bad
Y and you tell the system well that's
you know give a high energy to this like
push up the energy right change the
weights in the neural net that comput
the energy so that it goes
up um so that's contrasting methods the
problem with this is if the space of Y
is large the number of such contrasty
samples you're going to have to
show is
gigantic but people do this they they do
this when you train a system with RF
basically what you're training is what's
called a reward model which is basically
an objective function that tells you
whether an answer is good or bad and
that's basically exactly what what this
is so we already do this to some extent
we're just not using it for inference
we're just using it for training um uh
there is another set of methods which
are non contrastive and I prefer those
uh and those non-contrastive method
basically say
uh okay the energy function needs to
have low energy on pairs of xys that are
compatible that come from your training
set how do you make sure that the energy
is going to be higher everywhere
else and the way you do this is by um
having a regularizer a Criterion a term
in your cost function that basically
minimizes the volume of space that can
take low
energy and the precise way to do this is
all kinds of different specific ways to
do this depending on the architecture
but that's the basic principle so that
if you push down the energy function for
particular regions in the XY space it
will automatically go up in other places
because there's only a limited volume of
space that can take low energy okay by
the construction of the system or by the
regularizer regularizing function we've
been talking very generally but what is
a good X and a good Y what is a good
representation
of X and Y because we've been talking
about language and if you just take
language directly that presumably is not
good so there has to be some kind of
abstract representation of
ideas yeah so you I mean you can do this
with language directly um by just you
know X is a text and Y is a continuation
of that text yes um or X is a question
why is the answer but you're you're
saying that's not going to take it I
mean that's going to do what llms are
doing well no it depends on how you how
the internal structure of the system is
built if the if the internal structure
of the system is built in such a way
that inside of this system there is a
latent variable it's called Z that
uh you can manipulate so as to minimize
the output
energy then that Z can be viewed as a
representation of a good answer that you
can translate into a y that is a good
answer so this kind of system could be
train in a very similar way very similar
way but you have to have this way of
preventing collapse of of ensuring that
you know there is high energy for things
you don't train it on um and and
currently it's it's very implicit in llm
is done in a way that people don't
realize it's being done but is it is
being done is is due to the fact that
when you give a high probability to a a
word automatically you give low
probability to other words because you
only have a finite amount of probability
to go around right there have to some to
one so when you minimize the cross
entropy or whatever when you train the
your llm to produce the to predict the
next
word uh you're increasing the
probability your system will give to the
correct word but you're also decreasing
the probability will give to the
incorrect words now indirectly that
gives a low probability to a high
probability to sequences of words that
are good and low probability to
sequences of words that are bad but it's
very direct mhm and it's not it's not
obvious why this actually works at all
but um because you're not doing it on a
joint probability of all the symbols in
a in a sequence you're just doing it
kind of uh you sort of factorize that
probability in terms of conditional
probabilities over successive tokens so
how do you do this for visual data so
we've been doing this with all jepa
architectures basically the joint I so
uh there are the compatibility between
two things is uh you know here's here's
an image or a video here's a corrupted
shifted or transformed version of that
image or video or masked okay and then
uh the energy of the system is the
prediction error of
the
representation uh the the predicted
representation of the Good Thing versus
the actual representation of the good
thing right so so you run the corrupted
image to the system predict the
representation of the the good input
uncorrupted and then compute the
prediction error that's the energy of
the system so this system will tell you
this is a
good you know if this is a good image
and this is a corrupted version it will
give you Zero Energy if those two things
are effectively one of them is a
corrupted version of the other give you
a high energy if if the two images are
completely different and hopefully that
whole process gives you a really nice
compressed representation of of reality
of visual reality and we know it does
because then we use those for our
presentations as input to a
classification system something system
works really nicely okay well so to
summarize you recommend in a in a in a
spicy way that only Yan laon can you
recommend that we abandon generative
models in favor of joint embedding
architectures yes abandon autor
regressive generation yes abandon Pro
this feels like a court testimony uh
abandon probabilistic models in favor of
energy based models as we talked about
abandon contrastive methods in favor of
regularized methods and uh let me ask
you about this you've been for a while a
Critic of reinforcement learning yes so
what uh the last recommendation is that
we abandon RL in favor of Mo model
predictive control as you were talking
about and only use RL when planning
doesn't yield the pr predicted outcome
and uh we use RL in that case to adjust
the world model or the critic yes so uh
you mentioned uh rlf reinforcement
learning with human feedback uh why do
you still hate uh reinforcement learning
I don't hate reinforcement learning and
I think all of I think it should not be
uh abandoned completely but I think its
use should be minimized because it's
incredibly inefficient in terms of
samples and so the the proper way to
train a system is to First just have it
learn uh good representations of the
world and World models from Mostly
observation maybe a little bit of
interactions and then steered based on
that if the representation is good then
the adjustment should be
minimal yeah now there's two things you
can use if you've learned a world model
you can use the world model to plan a
sequence of actions to arrive at a
particular
objective you don't need a unless the
way you measure whether you succeed
might be inexact your idea of you know
whether you're going to fall from your
bike might be wrong or whether the
person you're fighting with MMA is going
to do something and do something else um
so there uh so there's two ways you can
be wrong either your your objective
function does not reflect the actual
objective function you want to optimize
or your world model is
inaccurate right so you didn't you the
prediction you were making about what
was going to happen in the world is
inaccurate so if you want to adjust your
world model while you are operating the
world or your objective function that is
basically in the realm of RL this is
what RL deal deals with uh to some
extent right so adjust your world model
and the way to adjust your world model
even in advance uh is to explore parts
of the space where your world model
where you know that your world model is
inaccurate that's called curiosity
basically or play right when you play
you kind of explore part of the St space
that
um you know you don't want to do in for
real because it might be dangerous but
but you can adjust your world model uh
without killing yourself basically um so
that's what you want to use ourl for
when when when it comes time to learning
a particular task you already have all
the good representations you already
have your world model but you want you
need to adjust it for the situation at
hand that's when you use RL why do you
think rhf works so well this
reinforcement learning with human
feedback why did it have such a
transformational effect on large
language models that before what's had
the transformational effect is human
feedback there's many ways to use it and
some of it is just purely supervised
actually it's not really reinforcement
rning so it's the the HF it's the HF
yeah uh and then there is ways to use
human feedback right so you can you can
ask humans to rate answers multiple
answers that are produced by World model
and uh and and then what you do is you
train an objective function to predict
that
rating and then you can use that
objective function to predict you know
whether an answer is good and you can
back propagate gradient through this to
find you new system so that it only
produces High highly rated
answers okay so that's
one way so that's like in ourl that
means uh training what's called a reward
model right so something that you know
basically a small on that that estimates
to what extent an answer is good right
it's very similar to The Objective I was
I was talking about talking about
earlier for planning except now it's not
used for planning it's it's used for
fine-tuning your
system I think it would be much more
efficient to use it for planning but um
but but uh currently it's used to fine
tune the parameters the system now there
there's several ways to do this um you
know some of some of them are supervised
you just you know ask a human person
like what is a good answer for this
right then you just type the answer um
uh I mean there's there's lots of ways
that those systems are are being
adjusted now a lot of people have been
very critical of the recently released
Google's Gemini
1.5 for essentially in my words I could
say super woke woke in the negative
connotation of that word uh there is
some almost hilariously absurd things
that it does like it modifies history uh
like generating images of a u black
George Washington or um perhaps more
seriously something that you commented
on Twitter which is refusing to comment
on or generate images of U or even
descriptions of tianan square or the the
Tank Man one of the most sort of
legendary protest
images in history and of course these
images are highly censored by the
Chinese government and therefore
everybody start asking questions of what
is the process of
uh designing these llms what is what is
what is the role of censorship in these
all that kind of stuff so you uh
commented on Twitter saying that open
source is the answer yeah essentially so
um can you
explain I I actually made that comment
on just about every Social Network I can
and I've I I've uh I've made that point
multiple times in in various forums um
uh here's my my point of view on this uh
people can complain that AI systems are
biased and they generally are biased by
the distribution of the training data
that they've been trained on um that
reflects biases in
society um and that is potentially
offensive to some
people or potentially not and and some
techniques to
debias then become offensive to some
people um because of you know historical
uh incorrectness and things like that
um and so you can ask the question you
can ask two questions the first question
is is it possible to produce an AI
system that is not biased and the answer
is absolutely not and it's not because
of
technological uh challenges although
there are technological challenges to
that it's
because bias is in the eye of the
beholder um different people may have
different ideas about what constitutes
bias um you know for a lot of uh a lot
of things I mean there are facts that
are you know indisputable but there are
a lot of opinions or or things that can
be expressed in different ways U and so
you cannot have an unbiased system
that's just an
impossibility um and so what's
the what's the answer to this and the
the answer is the same answer that we
found in Liberal democracy about the
press the Press to be free and uh
diverse we have free speech for a good
reason is because uh we don't want all
of our information to be uh to come from
a unique Source um because that's
opposite to the whole idea of democracy
and uh you know progress of ideas and
even science right in in science people
have to argue for different opinions and
and science makes progress when people
disagree and they come up with an answer
and you know a consensus forms right and
it's true in all democracies around the
world so there is a
future which is already happening where
every single one of our interaction with
the digital world will be mediated by ai
ai systems AI assistants right we're
going to have smart glasses you can
already buy them from MAA the ran MAA
where um you know you can talk to them
and they are connected with an llm and
you can get answers on any question you
have or you can be looking at a monument
and there is a camera in the in the
system that in in the glasses you can
ask it like can what can you tell me
about this uh building or this Monument
you can be looking at a menu in a
foreign language and I thing will
translate it for you or we can do real
time translation if we speak different
languages so a lot of our interactions
with the digital world are going to be
mediated by those systems in the near
future
um you know increasingly the search
engines that we're going to use are not
going to be search engines they're going
to be uh dialog systems that would just
ask a question and it will answer and
then point you to perhaps appropriate
reference for it but here is the thing
we cannot afford those systems to come
from a handful of companies on the west
coast of the
US because those systems will constitute
the repository of all human knowledge
and we cannot have that be controlled by
a small number of people right it has to
be diverse for the same reason the Press
has to be
diverse so how do we get a diverse set
of AI
assistant um it's very expensive and
difficult to train a based model right a
based llm at the moment you know in the
future it might be something different
but at the moment that's an
llm uh so only a few companies can do
this
properly and
if some of those Tob systems are open
source anybody can use them anybody can
fine-tune them um if we put in place
some systems that allows any group of
people whether they are um individual
citizens groups of
citizens government organizations
NOS uh companies whatever to take those
open source
uh systems AI systems and fine-tune them
for their own purpose on their own
data then we're going to have a very
large diversity of uh different AI
systems that are specialized for all of
those things right so I tell you I
talked to the French government quite a
bit and the French government will not
accept that the digital diet of all
their citizen be controlled by three
companies on the west coast of the US
that's just not acceptable it's a danger
to democracy regardless of how
well-intentioned those companies are
right um and so uh and it's also a
danger to local culture to values to
language right I was talking with
um uh the uh founder of infosis in India
um he's funding a project to fine tune
Lama 2 the open source model produced by
by meta so that Lama 2 speak all 22
official languages in India it's very
important for people in India I was
talking to a former colleague of mine
Mustafa used to be a scientist at fair
and then moved back to Africa I created
a research lab for Google in Africa and
now is as a new startup Kara and what
he's trying to do is basically have llm
that speak the local languages in Sagal
so that people can have access to
medical information because they don't
have access to doctors it's a very small
number of doctors per per capita in the
in syal um I mean you can't have any of
this unless you have open source
platforms so with open source platforms
you can have ai systems that are not
only diverse in terms of political
opinions or things of that type but in
terms of uh uh language culture value
systems political opinions
um technical abilities in various
domains and you can have an industry an
ecosystem of companies that fine-tune
those open source systems for vertical
applications in Industry right you you
have I don't know a publisher has
thousands of books and they want to
build a system that allows a customer to
just just ask a question about any about
the content of any of their books you
need to train on their proprietary data
right um You have a company we have one
within meta it's called metam and it's
basically an llm that can answer any
question about internal uh stuff about
about the the company U very useful a
lot of companies want this right a lot
of companies want this not just for
their employees but also for their
customers to take care of the customers
so the only way you're going to have an
AI industry the only way you're going to
have ai systems that are not uniquely
biased is if you have open source
platforms on top of which uh any group
Can U build specialized systems
so the the direction of of inevitable
direction of history is that the vast
majority of AI systems will be built on
top of Open Source platforms so that's a
beautiful Vision so meaning
like a company like meta or Google or so
on should take only minimal fine-tuning
steps after the building the foundation
pre-trained model as few steps as
possible
basically can meta afford to do that no
so I don't know if you you know this but
companies are supposed to make money
somehow and uh open source is is is like
giving away I don't know Mark made a
video Mark
Zuckerberg um very sexy video talking
about
350,000 Nvidia
h100s yeah the the math of that is just
for the gpus that's 100
billion um plus the infrastructure for
training everything so I'm no business
guy but how do you make money on that so
the division you payt is a really
powerful one but how is it possible to
make money okay so you have several
business models right the business model
that uh MAA is built
around
is um youer a
service and the the financing of that
service is uh either through ads or
through business customers so for
example if you have an llm that uh you
know can help a mom and pop pizza place
um by you know talking to their
customers through WhatsApp and so the
customers can just order a pizza and the
system will just you know ask them like
what topping do you want or what sites
blah blah blah um the business will pay
for that okay that's a
model
um and otherwise you know if it's a
system that is on the more kind of
classical Services it can be uh ad
supported or you know there's several
models but the point
is uh if you have a big enough um
potential customer base and you need to
build that
system anyway for
them it doesn't hurt you to actually
distribute it in open source again I'm
no business guy but if you release the
open source model then other people can
do the same kind of
task and compete on it basically provide
fine-tune models for businesses the is
the bet that meta is making by the way
I'm a huge fan of all this but is is the
bet that meta is making is like we'll do
a better job of it well no the the bet
is is more we have we already have a
huge user base and customer base ah
right right so it's going to be useful
to them whatever we offer them is going
to be useful and there is a way to
derive revenue from this
uh and it doesn't hurt that you know we
provide that
system or the Bas the base model right
the foundation model uh in open source
for others to build applications on top
of it too if those applications are not
to be useful for our customers we can
just buy it from them um uh it could be
that they will improve the platform in
fact we see this already um I mean there
is you know literally millions of
downloads of Lama 2 and thousands of
people who have you know provided ideas
about how to make it better um so you
know this this clearly accelerates
progress to make the system available to
a sort of a a wide community of people
and and there is literally thousands of
businesses who are building applications
with it so um
so our ability to meta's ability to
derive revenue from this technology is
not impaired
uh by the distribution of it of based
models in open source the fundamental
criticism that Gemini is getting is that
as you point out on the west coast just
to just to clarify we're currently in
the east coast where I would suppose
meta AI headquarters would
be so there uh strong words about the
West Coast but uh I guess the issue that
happens is I think it's fair to say that
most tech people have
a political affiliation with the left
wing they're they lean left and so the
problem that people are criticizing
Gemini with is that there's in that
debiasing process that you mentioned
that their
ideological
lean becomes
obvious uh is this something that could
be
escaped you're saying open source is the
only way have have you witnessed this
kind of ideological lean that makes
engineering difficult no I don't think
it has to do I don't think the issue has
to do with the political leaning of the
people designing those systems it has to
do with the uh acceptability or
political leanings of the their customer
based audience right so a big company
cannot afford to offend too many people
so they're going to make sure that
whatever product they put out is safe
whatever that means
and and it's very possible to overdo it
and it's also very possible to it's
impossible to do it properly for
everyone you're not going to satisfy
everyone so that's what I said before
you cannot have a system that is
unbiased is perceived as unbiased by
everyone it's going to be you know you
push it in one way one set of people are
going to see it as biased and then you
push it the other way and another set of
people is going to see it that's biased
and then in addition to this there's the
issue of if you push the system perhaps
a little too far in One Direction it's
going to be non-factual right you're
going to have you know you know black
Nazi uh soldiers in the we should we
should mention image generation of of uh
black Nazi soldiers which is not
factually accurate right and can be
offensive for some people as well right
so
uh so you know it's going to be
impossible to kind of produce systems
that are unbiased for everyone so the
only solution that I see is diversity
and diversity in the full meaning of
that word diversity of in every possible
way yeah uh Mark Andre just tweeted
today let me do a tldr the conclusion is
only startups and open source can avoid
the issue that he's highlighting with
big Tech he's asking can big Tech
actually field generative AI products
one ever escalating demands from
internal Act activists employee mobs
crazed Executives broken boards pressure
groups extremist Regulators government
agencies the press in quotes experts and
everything uh corrupting the output two
constant risk of generating a bad answer
or drawing a bad picture or rendering a
bad video who knows what is going to say
or do at any moment three legal exposure
product liability slander election law
many other things and so on anything
that makes Congress
mad four continuous attempts to tighten
grip un acceptable output degrade the
model like how good it actually is uh in
terms of usable and pleasant to use and
effective and all that kind of stuff and
five publicity of bad text images video
actually puts those examples into the
training data for the next version so on
so he just highlights how difficult this
is from all kinds of people being
unhappy as you said you can't create a
system that makes everybody happy yes uh
so if you're going to do the fine tun
yourself and keep a close
Source essentially the problem there is
then trying to minimize the number of
people who are going to be unhappy y um
and you're saying like the only that
that almost impossible to do right and
it's the better ways to do open source
phasically yeah I mean he's Mark is
right about uh a number number of things
that you list that indeed scare um large
companies uh you know certainly
Congressional investigations is one of
them legal liability uh you know uh
making things that get people to you
know hurt themselves or hurt others like
you know um big companies are really
careful about not um producing things of
this type and
um because they have you know they want
to hurt anyone first of all and then
second they want to preserve their
business so um it's essentially
impossible for systems like this that
can inevitably formulate political
opinions and you know opinions about
various things that may be political or
not but that people may disagree about
about you know moral issues and you know
um things about like questions about
religion and things like that right or
or cultural issues that people from
different communities would disagree
with in the first place um so there's
only kind of a relatively small number
of things that people
will uh sort of agree on you know basic
principles but beyond that if you if you
want those systems to be useful they
will necessarily have to uh
offend a number of people
inevitably and so open source is just
better and then diversity is better
right and open source enables diversity
that's right open source enables
diversity that this can be fascinating
world where
if it's true that the open source world
if meta leads the way and creates this
kind of Open Source Foundation model
world there's going to be like
governments will have a find new model
and yeah and and then
potentially uh you know people that vote
left and right will have their own model
and preference and be able to choose and
it will potentially divide us even more
but that's on us humans we get to figure
out basically the technology enables
humans to human more effectively and all
the difficult ethical questions that
humans raise will just it'll um leave it
up to us to figure it out yeah I mean
there are some limits to what you know
the same way there are limits to free
speech there has to be some limit to the
kind of stuff that those systems might
be authorized
to um to produce um you know some guard
rails so I mean that's one thing I've
been interested in which is uh in the
type of architecture that we were
discussing before where the output of a
system is the result of an inference to
satisfy an objective that objective can
include guard
rails and uh we can put guard rails in
open source systems I mean if we
eventually have systems that are built
with this blueprint uh we can put guard
rails uh in those systems that guarantee
that there is sort of a minimum set of
guardrails that make the system non-
dangerous and nontoxic Etc you know
basic things that everybody would agree
on um and and then you know the the fine
tuning that people will add or the
additional guardwell that people will
add will kind of cater to their um
Community whatever it is and yeah the
fine doing will be more about the gray
areas of what is hate speech what is
dangerous and all that kind of stuff I
mean you've different value systems
value systems I mean like uh but still
even with the objectives of how to build
a bioweapon for example I think
something you've commented on or at
least there's a
paper where a collection of researchers
is trying to understand the social
impacts of these
llms and I guess one threshold is nice
is like does the llm make it any easier
than a than a search would like a Google
search would right so the increasing uh
number of studies on this seems to point
to the fact that it doesn't help so
having an llm doesn't help you design or
build a bioweapon or a chemical weapon
if you already have access to uh you
know a search engine and a library uh
and and so the the S of increased
information you get or the ease with
which you get it doesn't really help you
um that's the first thing the second
thing is it's one thing to have a list
of instructions of how to make a
chemical weapon for example or bioweapon
it's another thing to actually build it
and it's much harder than you might
think and then LM will not help you with
that um in fact you know nobody in the
world not even like you know countries
use bioweapons because most of the time
they have no idea how to protect their
own populations against it so um so it's
too dangerous actually to kind of ever
use um and it's in fact banned by uh
International treaties um chemical
weapons is different it's also banned by
treaties U but um uh but it's the same
problem it's difficult to use in
situations that doesn't turn against the
perpetrators but we could ask you on
musk like I can I can give you a very
precise list of instructions of how you
build a rocket engine M and even if you
have a team of 50 Engineers that of re
experienc building it you're still going
to have to blow up a dozen of them
before you get when that
works um and you know it's the same with
uh you know chemical weapons or biow
weapons or things like this it requires
expertise you know in the in the real
world that n is not going to help you
with and it requires even the common
sense expertise that we've been talking
about which is how to take uh
language-based instructions and
materialize them in the physical world
requires a lot of knowledge that's not
in the instructions yeah exactly a lot
of biologists have posted on this
actually in response to those things
saying like do you realize how hard it
is to actually do the the lab work I you
know this is not
trivial yeah and that's Hans Marik comes
comes to light once again uh just the
Linger on llama you know Mark announced
that llama 3 is coming out eventually I
don't think there's a release date but
what what are you most excited about
first of all llama 2 that's already out
there and maybe the future llama 3 4 5 6
10 just the the future of the open
source under
meta well a number of things so uh
there's going to be like various
versions of of Lama that are uh you know
improvements of previous llamas bigger
better multimodal things like that and
then in future Generations systems that
are capable of planning that really
understand how the world Works um maybe
are trained from video so they have some
World model maybe you know capable of
the type of reasoning and planning I was
talking about earlier like how long is
that going to take like when is the
research that is doing going in that
direction going to sort of feed into the
product line if you want of L I don't
know I can't tell you and there's you
know a few breakthroughs that we have to
basically uh go through before we can
get there but you'll be able to monitor
our progress because we publish our
research right so you know if last week
we published the Via work which is sort
of a first step towards Training Systems
from video um and then the next step is
going to be World models based on on
kind of this type of idea training
training from video there similar work
at at Deep Mind also and um the taking
place people and also at UC brookley on
uh World models from video a lot of
people are working on this I think a lot
of good ideas are coming are appearing
my bet is that those systems are going
to be Jep alike they're not going to be
gener generative models um and uh we'll
see what the future will tell um there's
really good work at uh um a gentleman
called danar Hafner who is not Deep Mind
who who's worked on kind of models of
this type that learn representations and
then use them for planning or learning
tasks by reinforcement running um and a
lot of work at brookley by Peter iil S
leine bunch of other people of that type
uh I'm collaborating with actually in
the context of some grants with my NYU
hat um
and then collaborations also through
meta because the the lab at brookley is
associated with meta in some way so with
fair so I I think uh it's very exciting
you know I I think I'm super excited
about I I haven't been that excited
about like the direction of machine
learning and AI you know since uh you
know 10 years ago when Fairway started
and before that um 30 years ago when we
working on 35 on on com Nets and and and
the early days of neural net so um I'm
super excited because I see a path
towards potentially human level
intelligence uh with you know systems
that can understand the world remember
plan reason um there there is some some
set of ideas to make progress there that
might have a chance of working and I'm
really excited about this what I like is
that you know it
uh somewhere we we get onto like a good
direction and perhaps succeed before my
brain turns to white sauce or or before
I need to
retire yeah yeah uh you're also excited
by are
you is it beautiful to you just the
amount of gpus involved sort of the the
the whole training process on this much
compute it's just zooming out just
looking at Earth and humans together
have built these Computing devices
and are able to train this one
brain then then we then open
source like giving birth to this
open-source brain trained on this
gigantic compute system there's just the
details of how to train on that how to
build the infrastructure and the the
hardware the cooling all of this kind of
stuff U or you just still the most of
your excitement is in the the theory
aspect of it the meaning like the
software
well I used to be a hardware guy many
years ago yes yes that's decades ago
Hardware has improved a little bit
changed a little bit yeah I mean
certainly scale is necessary but not
sufficient absolutely so we certainly
need competition I mean we're still far
in terms of compute
power uh from you know what we would
need to match the compute power of the
human brain um you know this may occur
in the next couple decades but um but
we're still some ways away and certainly
in terms of power efficiency were really
far um so a lot of progress to make in
uh in in in hardware and you know right
now a lot of progress is is is not I
mean there's a bit coming from Silicon
technology but a lot of it coming from
architectural Innovation and quite a bit
coming from uh like more efficient ways
of you know implementing the
architectures that have become popular
basically combination of Transformers
and cets right and uh so you know
there's still some ways to go
until uh we're going to saturate we're
going to have to come up with like new
new principles new fabrication
technology new uh basic
components um perhaps you know based on
sort of different principles than those
classical digital semas interesting so
you
think in order to build Ami M me we need
we potentially might need some Hardware
Innovation too well if you want to make
it um ubiquitous yeah certainly because
we're going to have to reduce the you
know comput the power consumption a GPU
today right is half a kilowatt to a
kilowatt human brain is about 25
wats uh and the GPU is way below the
power of human brain you need you know
something like a 100,000 or million to
match it so uh so you know we're off by
huge Factor
here you often say that
AGI is not coming soon meaning like not
this year not the next few years
potentially farther away what's your
basic intuition behind that so first of
all it's not going to be an event right
the idea somehow which you know is
popularized by science fiction and
Hollywood that you know somehow somebody
is going to discover the secret the
secret to a gii or human level AI or Ami
whatever you want to call it and then
you know turn on a machine and then we
have a gii that's just not going to
happen it's not going to be an
event it's going to be gradual
progress are we going to have systems
that can learn from video how the world
works and learn good World presentations
yeah before we get them to the scale and
performance that we observe in humans
it's going to take quite a while it's
not going to happen in one day um uh are
we going to get systems that can uh have
large amount of associative memory so
they can they can remember stuff yeah
but same it's not going to happen
tomorrow I mean there is some basic
techniques that need to be developed we
have a lot of them but like you know to
get this to work together with full
system is another story how we going to
have system that can reason and plan
perhaps along the lines of the objective
driven AI architectures that I I
described before yeah but like before we
get this to work you know properly it's
going to take a while so
and before we get all those things to
work together and then on top of this
have systems that can learn like
hierarchical planning hierarchical
representations systems that can be
configured for a lot of different
situation at hands the way the human
brain can um you know all of this is
going to take you know at least a decade
and probably much more because there are
a lot of problems that we're not seeing
right now we have not encountered and so
we don't know if there is a easy
solution within this framework
um so you know it's it's not just around
the corner I mean I've I've been hearing
people for the last 12 15 years claiming
that you know edgi is just around the
corner and being systematically wrong
and I knew they were wrong when they
were saying it I call their
why do you think people have
been calling first of all I mean from
the beginning of from the birth of the
term artificial intelligence there has
been a Eternal
optimism that's perhaps unlike other
Technologies is it a Maric Paradox is
the explanation for why people are so
optimistic about AGI I don't think it's
just Marx Paradox Marx Paradox is a
consequence of realizing that the world
is not as easy as we think so first of
all um intelligence is not a linear
thing that you can measure with a scalar
with a single number um you know can you
say that humans are smarter
than WR tongs in some ways yes but in
some waysons are smarter than humans in
a lot of domains that allows them to
survive in the forest for example so IQ
is a very limited measure of
intelligence T intelligence is bigger
than what IQ for example measures well
IQ can measure you know approximately
something for humans MH but um because
humans kind of you know come in
relatively kind of uniform form right
right uh but it only measures one type
of uh ability that you know may be
relevant for some test but not
others
and uh but then if you talking about
other intelligent entities for which
the you know the the basic things that
are easy to them is very
different then it doesn't mean anything
so intelligence is a collection of
skills and an ability to acquire new
skills efficiently mhm
right and the collection of skills that
an need intelligent particular
intelligent entity possess or is capable
of learning quickly is different from
the collection skills of another one and
because it's a multi-dimensional thing
the set of skills is high dimensional
space you can't measure you can compare
you cannot compare two things as to
whether one is more intelligent than the
other it's
multi-dimensional so you push back
against what are called AI doomers a lot
uh can you explain their perspective and
why you think they're wrong okay so a I
doomers imagine all kinds of catastrophe
scenarios of how AI could Escape or
control and basically kill us
all uh and that relies on a whole bunch
of assumptions that are mostly
false so the first assumption is that
the emergence of super intelligence is
going to be an
event that at some point we're going to
have we're going to figure out the
secret and we'll turn on a machine that
is super
intelligent and because we've never done
it before it's going to take over the
world and kill us all that is false it's
not going to be an event we're going to
have systems that are like as smart as a
cat has all the have all the
characteristics of you know human level
intelligence but their level of
intelligence would be like a cat or a
pirrot maybe
or
something um and then we're going to
work our way up to kind of make those
things more intelligent and as we make
them more intelligent we're also going
to put some guard rails in them and
learn how to kind of put some guard
rails so they behave properly and we're
not going to do this with just one it's
not going to be one effort there's going
to be lots of different people doing
this and some of them are going to
succeed at making intelligent systems
that are uh controllable and safe and
have the right guard rails and if some
other goes wrog then we can use the the
good ones to go against the Rogue ones
uh so it's going to be my you know smart
AI police against your Rogue AI um so
it's not going to be like you know we're
going to be exposed to like a single
Rogue AI that's going to kill us all
that's just not not happening now there
is another fallacy which is the fact
that because the system is intelligent
it necessarily wants to take over MH um
and there is several
arguments that make people scare of this
which I think are completely false uh as
well so one of them is um you know in
nature it seems to be that the more
intelligent species otherwi that end up
dominating the other and uh and
even you know extinguishing the others
uh sometimes by Design sometimes just by
mistake
and and so you know there is sort of uh
Thinking by which you say well if AI
systems are more intelligent than us
surely they're going to eliminate us if
not by Design simply because they don't
care about us and that's just
Preposterous for for a number of reasons
um first reason is they're not going to
be a species they're not going to be a
species that competes with us they're
not going to have the desire to dominate
because the desire to dominate is
something that has to be hardwired into
an intelligent
system uh it is hardwired in
humans it is hardwired in baboons in
chimpanzees in wolves
not in a wrong
Tes the species in which this desire to
dominate or submit or or attain status
in other
ways is is specific to social species
non-social species like our tongs don't
have it right and they are as smart as
we are almost right and to you there's
not significant incentive for humans to
encode that into the AI systems and to
the degree they do there'll be AIS that
um sort of punish them for it I'll
compete them over well there's all kinds
of incentive to make AI system
submissive to humans right right I mean
this is the way we're going to build
them right um and so so then people say
oh but look at llms LMS are not
controllable and they're right LMS are
not controllable but objective driven AI
so systems that derive their Answers by
optimization of an objective means they
have to optimize this objective and that
objective can include guard rails one
guardrail is uh obey humans another
guardrail is don't obey humans if it's
hurting other humans with I've heard
that before somewhere I don't remember
yes maybe in a book yeah uh but speaking
of that book what is could there be
unintended consequences also from all of
this no of course uh so this is not a
simple problem right I mean uh designing
those guard rail so that the system
behaves properly is not going to be a a
simple
uh issue that for which there is a
silver bullet for which you have a
mathematical proof that the system can
be safe it's going to be very
Progressive iterative design system
where we put those guard rails in such a
way that the system behave properly and
sometimes they're going to do something
that you know was unexpected because the
guardare wasn't right and we're going to
correct them so that they do it right uh
the idea somehow that we can't get it
slightly wrong because if we get it
slightly wrong we all die is is
ridiculous
um we we're just going to go
progressively and it's it's just going
to be the the analogy I've used many
times is um is uh turbojet design um how
how did we figure out how to make
turbojet so unbelievably reliable right
uh I mean those are like you know
incredibly complex uh pieces of Hardware
that run at really high temperatures for
you know 20 20 hours at a time sometimes
and we can you know fly halfway around
the world with a on a two
engine
uh jetliner at near the speed of sound
like how incredible is this it's just
unbelievable right
and did we do this because we invented
like a general principle of how to make
Turbo Jet safe no we it took decades to
kind of fine-tune the design of those
systems so that they they were safe is
there a separate uh group Within General
Electric or snma or whatever that is
specialized in turo jet safety no it's
the design is all about safety because a
better Turbo Jet is also a safer Turbo
Jet so um a more reliable one it's the
same for AI like do you do you need you
know specific Provisions to make AI safe
no you need to make better AI systems
and they will be safe because they are
designed to be more
useful uh and more controllable so let's
imagine a system AI system that's able
to be incredibly
convincing and can convince you of
anything I I can at least imagine such a
system and I can see such a system
be weapon-like because it can control
people's minds we're pretty gullible we
we want to believe a thing you can have
an A system that controls it and you
could see governments using that as a
weapon so do you think if you imagine
such a system
there's any parallel to something like
nuclear weapons no so is why why why is
that technology different so you're
saying there's going to be gradual
development yeah there's going to be I
mean it might be rapid but they'll be
iterative and then we'll be able to kind
of respond and and so on so that AI
system designed by Vladimir Putin or
whatever or his uh minions uh you know
is going to be uh like talking to trying
to talk to every American to uh convince
them to vote for you know whoever
whoever pleases Putin sure uh or
whatever or or you know or R people up
against each other um as they've been
trying to
do they're not going to be talking to
you they're going to be talking to your
AI assistant mhm which is going to be as
smart as theirs MH right right that AI
because as I said in the future every
single one of your interaction with the
digital world will be mediated by your
AI assistant so the first thing you're
going to ask is is this a scam like is
this thing like telling me the truth
like it's not even going to be able to
get to you because it's only going to
talk to your AI assistant and your AI
assistant is not not even going to it's
going to be like a spam filter right
you're not even seeing the email the
spam email right it's automatically put
in a folder that you never see um it's
going to be same thing that AI system
that tries to convince you of something
is going to be talking to a assistant
which is going to be at least as smart
as
it and is going to say this is spam you
know U it's not even going to bring it
to your attention so to you it's very
difficult for any one AI system to take
such a big leap ahead to where it can
convince even the other AI systems so
like it there's always going to be this
kind of race where nobody's way ahead
that's the history of the world history
of the world is you know whenever there
is a prog at some someplace there is a
countermeasure and and you know it's a
it's a Katan mous game well this is why
mostly yes but this is why nuclear
weapons are so interesting because that
was such a powerful weapon that it
mattered who got it
first that you know you could imagine
Hitler
Stalin ma getting the weapon first and
that that having a different kind of
impact on the world than than the United
States getting the weapon first but to
you nuclear weapons is is like you you
don't imagine a uh breakthrough
Discovery and then Manhattan Project
like effort for AI no as I said it's not
going to be an event it's going to be
you know continuous progress and and
whenever you know one breakthrough
occurs it's going to be widely
disseminated really quickly yeah
probably first within industry I mean
this is not a domain where you know
government or military organizations are
particularly Innovative and they're in
fact way behind um and so this is going
to come from industry and and this kind
of information disseminates extremely
quickly we've seen this over the last
few years right where you have a new
like you know even take alphao this was
reproduced within three
months even without like particularly
detailed information right yeah this is
an industry that's not good at secrecy
no but even even if there is just the
fact that you know that something is
possible yeah uh makes you like realize
that it's worth investing the time to
actually do it you you may be the second
person to do it but you know you'll
you'll do it uh and you know same for
you know all the Innovations you know
self supervisor Transformers decoder
only architectures llms I mean those
things you don't need to know exactly
the details of how they work to know
that you know it's
possible um because it's deployed and
then it's getting reproduced and then
you know people who work for those
companies move they go from one company
to another and you know the information
disseminates what makes the success of
the the US tech industry and Silicon
Valley in particular is exactly that is
because information circulates really
really quickly and this you know
disseminates very quickly and so you
know the the whole region sort of is
ahead because of that circulation of
information so maybe I just to linger on
the psychology of AI doomers you give uh
in the classic Yan laon way a pretty
good example of just when a a new
technology comes to be you say uh
engineer says I invented this new thing
I call it a
ballpen and then the Twitter sphere
responds OMG people could write horrible
things with it like misinformation
propaganda Hast speech ban it now then
writing doomers come
in akin to the AI doomers imagine if
everyone can get a ballpen this could
destroy Society there should be a law
against using ballpen to write hate
speech regulate ballpens now and then
the pencil industry Mogul says yeah
ballpens are very dangerous unlike
pencil writing which is erasable ballpen
writing stays forever government should
require a license for a pen
manufacturer I mean this does seem to be
part of um human
psychology when when it comes up against
new
technology so what what deep insights
can you speak to about this well there
is a a natural fear of uh new technology
and the impact it can have in society
and people have kind of instinctive
reaction to um you know the world they
know being threatened by Major
Transformations um that are either
cultural phenomena or technological um
revolutions and they fear for their
culture they feel for their job they
feel for they fear for their you know
the future of their children um and uh
their way of life right so so any change
um is feared and
and you see this you know long history
like any technological Revolution or
cultural phenomenon was always
accompanied by uh you know groups or
reaction in the
media uh that that basically
attributed the all the problems the
current problems of society to that
particular change right electricity was
going to kill everyone at some point you
know you uh the train was going to be a
horrible thing because you know you
can't breathe past 50 kilm an hour um
and so there's a wonderful website
called a pessimist
archive right which has all those
newspaper clips of all the horrible
things people imagine would would arrive
because of uh either technological uh
Innovation or uh a cultural phenomenon
um you
know the this is wonderful examples of
uh uh you know jazz or comic books being
blamed
for uh unemployment or or you know young
people not wanting to work anymore and
things like that right and and that has
existed for for
centuries um and it's you know knee-jerk
reactions um the question is you know do
we Embrace
change uh or do we resist it
and what are the real dangers as opposed
to the imagined uh imagined
ones so people worry about I think one
thing they worry about with big Tech
something we've been talking about over
and over but I think
worth mentioning again they worry about
how powerful AI will be and they worry
about it being in the hands of one
centralized power of just a handful of
central control and so that's the
skepticism with big Tech you can make
these companies can make a huge amount
of money and control this technology and
by so doing you know take advantage uh
abuse the little guy in society well
that's exactly why we need open source
platforms yeah I just wanted
to nail the point home more and more yes
um so let me ask you on your like I said
you do get a little bit uh um you know
flavorful on the internet uh yos
shabbach tweeted something that you
loled at uh in reference to H 9000 quote
I appreciate your argument and I fully
understand your frustration but whether
the pod bay doors should be opened or
closed is a complex and nuanced issue so
you're at the head of meta
AI um you know this is something that
really worries me that AI our AI
overlords will speak down to us with
corporate speak um of this nature and
you sort of resist that with your way of
being um is this something you can just
comment on of working at a big
company how you
can avoid
the over fearing I
suppose the through caution create harm
yeah again I think the answer to this is
open source platforms and then en
enabling a widely diverse set of people
to build AI
assistance that represent the diversity
of uh cultures opinions languages and
value systems across the world um so
that you're not bound to just uh you
know be brainwashed by a particular way
of thinking because of single AI entity
um so I mean I I think it's really
really important question for society
and the problem I'm seeing is
um is that um which is why I've been so
vocal and sometimes a little sardonic
about it never stop never stop
Yan we love it is because I see the
danger of this concentration of power
through through proprietary AI systems
has a much bigger danger than everything
else that if we really want you know uh
diversity of opinion uh AI systems that
you know in in this future that where
we'll all be interacting through AI
systems we need those to be diverse for
the preservation of uh uh diversity of
ideas and you know Creeds and political
opinions and and and
whatever uh and the preservation of
democracy
and what works against this is people
who think that for reasons of security
we should keep AI systems under lock and
key because it's too dangerous to put it
in the hands of of
everybody um because it could be used by
terrorists or something
um that would lead
to uh you know potentially a uh a very
bad future in which all of our
information diet is controlled by a
small number of uh companies through
proprietary
systems do you trust humans with this
technology to uh to build systems that
are on the whole good for Humanity isn't
that what democracy and free speech is
all about I think so do you trust
institutions to do the right thing do
you trust people to do the right thing
and and yeah there's bad people who are
going to do bad things but they're not
going to have Superior technology to the
good people so then it's going to be my
good AI against your bad AI right I mean
there the examples that we were just
talking about of you you know maybe uh
some Rogue country will build you know
some AI system that's going to try to
convince everybody
to go into a civil war or something or
or or elect a favorable U ruler and um
but then they will have to go past our
AI systems right an AI system with a
strong Russian accent will be trying to
conv our and doesn't put any uh articles
in their
sentences um well it'll be at the very
least absurdly comedic okay uh
so I uh since we talked about sort of
the uh physical reality I'd love to ask
your vision of the future with with
robots in in this physical reality so
many of the kinds of intelligence you've
been speaking about would Empower robots
to be more effective collaborators with
us humans so um since uh Tesla's Optimus
uh team has been showing off some
progress on humanoid robots
I think it really reinvigorated the
whole industry that's that I think
Boston Dynamics has been leading for a
very very long time so now there's all
kinds of companies figure AI obviously
Boston Dynamics um un tree un tree uh
but there's like a lot of them it's
great it's great I mean I love it uh so
do you think there'll be uh millions of
humanoid robots walking around soon not
soon but it's going to it's going to
happen like the next decade I think is
going to be really interesting in robots
like the the emergence of the robotics
industry has been in the waiting for you
know 10 20 years without really emerging
other than for like you know kind of
preprogram behavior and stuff like that
um and uh and the main issue is again
the Maric Paradox like you know how do
we get those system to understand how
the world works and and kind of you know
plan actions and so we can do it for
really specialized tasks um and uh the
way Boston Dynamics goes about it is you
know basically with a lot of um
handcrafted dynamical models and careful
planning uh in advance which is very
classical robotics with a lot of
innovation a little bit of perception um
but it's still not like they can't build
a domestic robot right um and you know
we're still some distance away from
completely autonomous level five driving
mhm uh and we're certainly very far away
from having uh you know level five
autonomous driving bi A system that can
train Itself by driving 20 hours like
any 17y
old uh so until we
have uh again World models systems that
can train themselves to understand how
the world
Works uh we're not going to we're not
going to have significant progress in
robotic
so a lot of the people working on
robotic Hardware at the moment are are
betting or banking on the fact that AI
is going to make sufficient progress
towards that and they're hoping to
discover a product in it too is uh yeah
before you have a really strong World
model there'll be an almost Strong World
model and um people are trying to find a
product in a clumsy robot I suppose like
not a perfectly efficient robot so
there's the fact factory setting where
uh humanoid robots can help automate
some aspects of the factory I think
that's a crazy difficult task because of
all the safety required and all this
kind of stuff I think in the home is
more interesting but then you start to
think I think you mentioned loading the
dishwasher right yeah like I suppose
that's one of the main problems you're
working on I mean there's you know uh
cleaning up cleaning the house uh clear
clearing up the table after
meal washing the dishes you know all
those tasks you know cooking I mean all
the tasks that you know in principle
could be automated but are actually
incredibly sophisticated really
complicated but even just basic
navigation around an un Space full of
uncertainty that's sort of works like
you can sort of do this now navigation
is fine well navigation in a way that's
compelling to us humans is is is a
different thing yeah it's not going to
be you know necessarily I mean we have
demos actually because you know there is
a So-Cal embodied AI group at at fair
and uh you know they've been not
building their own robots but using
commercial robots um and you can you can
tell a robot dog like you know go to the
fridge and they can actually open the
fridge and they can probably pick up a
can in the fridge and stuff like that
and and bring it to you you know so it
can navigate can grab objects as long as
it's been trying to recognize them which
you know Vision systems work pretty well
nowadays um but but it's not like a
completely you know
General robot that would be you know
sophisticated enough to do things like
clearing up the dinner
table Yeah to me that's an exciting
future of getting humanoid robots robots
in general in the whole more and more
because that gets uh humans to really
directly interact with AI systems in the
physical space and in so doing it allows
us to philosophically psychologically
explore our relationships with robots
can be really really really interesting
so I hope you make progress on the whole
uh japa thing soon well I I mean I hope
I hope things kind of you know work as
uh as planned
um I mean again we've been kind of
working on this idea of self supervised
learning of from video for for 10 years
and and you know only made significant
progress in the last two or three and
actually you've you've mentioned that
there's a lot of interesting
breakthroughs that can happen without
having access to a lot of compute yeah
so if you're interested in doing a PhD
and this kind of stuff there's a lot of
possibilities still yeah to do
Innovative work so like what advice
would you give to a undergrad that's
looking to uh go to grad school and do a
PhD so basically I've listed them
already uh this idea of how do you train
a world model by
observation and you don't have to train
necessarily on gigantic data sets
or I mean you could out to be necessary
to actually train on large data sets to
have emerging properties like like we
have with LMS but I think there's a lot
of good ideas that can be done
without necessarily scaling up then
there is how you do planning with a
learn World model if the world the
system evolves in is not the physical
world but it's the world of let's say
the internet or you know some sort of uh
world of where an action consists in
doing a search in a search engine or
interrogating a data database or running
a simulation or calling a calculator or
solving a differential
equation how do you get a system to
actually plan a sequence of actions to
you know give the solution to a problem
um and so the question of planning is
not just a question of planning physical
actions could be you know planning
actions to use tools for a dialog system
or for any kind of intelligent
system and um there's some work on this
but not like not a huge amount some work
at Fair um one called tool former which
was couple years ago and some more
recent work on planning U but um but I
don't think we have like a good solution
for any of that then there is the
question of hierarchical planning so the
example I I mentioned of you know
planning a trip from New York to Paris
that's hierarchical but almost every
action that we take involves
hierarchical planning in some in some
sense and we really have absolutely no
idea how to do this like this's zero
demonstration of hierarchical
planning uh in
AI where the various levels of
representations that are necessary have
been learned we can do like two level
hierarchy hierarchical planning when we
design the two the two levels so for
example you have like a a dog lag robot
right you want it to go from
the living room to the kitchen you can
plan a path that avoids the obstacle and
then um you can send this to a lower
lower level planner that figures out how
to move the legs to kind of Follow that
trajectories right so that works but
that twole planning is designed by hand
right um we specify what the proper
levels of abstraction the representation
that each level of attraction has have
to be how do you learn this how do you
learn that hierarchical representation
of action
plans right we you know with cight and
deep learning we we can train the system
to learn hierarchical representations of
percepts mhm what is the equivalent when
what you're trying to represent our
action plans for action plans yeah so
you want you want basically a robot dog
or humanoid robot that turns on and
travels from New York to Paris all by
itself for example all right they might
have some uh trouble at the at the TSA
but yeah no but even doing something
fairly simple like a household task sure
like you know uh cooking or something
yeah that there's a lot involved it's a
super complex task we take and once
again we take it for
granted what hope do you have for um the
future of
humanity we're talking about so many
exciting Technologies so many exciting
possibilities what gives you hope when
you look out over the next 10 20 50 100
years if you look at social media media
there's a lot of there's there's Wars
going on there's
division uh there's hatred all this kind
of stuff that's also part of humanity
but amidst all that what gives you
hope I don't have that question
uh we can make Humanity Smarter with
AI okay I mean AI basically will amplify
human
intelligence it's as if if every one of
us will have a staff of smart AI
assistants they might be smarter than us
they'll do our
bidding perhaps execute a task in ways
that are much better than we could do
ourselves because they'll be smarter
than us and so it's like everyone would
be the the boss of a staff of super
smart virtual
people so we shouldn't feel threatened
by by this any more than we should feel
threatened by being the manager of a
group of people some of whom are more
intelligent than
us I certainly have a lot of experience
with this of uh you know having people
working with me who are smarter than me
um that's actually a wonderful thing so
uh having machines that are smarter than
us that assist us in our all of our
tasks our daily lives whether it's
professional or personal I think would
be absolutely wonderful thing because
intelligence is the
most um is the commodity that is most in
demand that that's really what I mean
all the mistakes that Humanity makes is
because of lack of intelligence really
or lack of knowledge which is you know
related so um making people smarter
would just can only be better I mean for
the same reason that you know public
education is a good
thing and books are a good thing and the
internet is also a good thing
intrinsically and even social networks
are a good thing if you run them
properly it's difficult but you know you
can
um uh because you know it
it's helps the communication of
information and knowledge and the
transmission of knowledge so AI is going
to make Humanity
smarter and the analogy I've been
using is the fact that perhaps an
equivalent event in a history of
humanity to what might be provided by
journalized is the invention of the
printing the printing press it made
everybody smarter the fact
that people could uh have access to um
two books books were a lot cheaper than
they were before and so a lot more
people had an incentive to learn to read
which wasn't the case before um and
people became smarter it it enabled the
enlightenment right there wouldn't be an
Enlightenment without the printing press
it
enabled
uh philosophy
rationalism U escape from religious
Doctrine um
democracy
science uh and certainly without this it
wouldn't be there wouldn't have been
theeran American Revolution the French
Revolution and so was still be under
feudal regimes perhaps um and so it
completely transformed the the world
because people became smarter and kind
of learn learn about things now it also
created 200 years of essentially
religious conflicts in Europe right
because the first thing that people read
was the Bible and uh realized that
perhaps was a different interpretation
of the Bible than what the priests were
telling
them and so that created the Protestant
movement and created the rift and in
fact the Catholic School the Catholic
Church didn't like the idea of the
printing price but they had no choice
and so it had some bad effects and some
some good effects I don't think anyone
today would say that the invention of
the printing press had a overall
negative effect despite the fact that it
created 200 years of religious
conflicts uh in Europe now compare this
and I I thought uh I was very proud of
myself to come and put this analogy but
realized someone else uh came with the
same idea before me um compare this with
what happened in the Ottoman Empire the
Ottoman Empire banned the printing press
for 200
years
uh and it didn't ban it uh for all
languages only for Arabic you could
actually print books in Latin or Hebrew
or whatever in the Ottoman Empire just
not in
Arabic and
uh I thought it was because the rers
just wanted to preserve the control over
the population and the Dogma religious
dogma and everything but after talking
with the uh UAE minister of AI uh
Omar um he told me no there was another
reason uh and the other reason was that
uh it was to preserve the corporation of
calligraphers right there's like a an
art form which is you know writing those
beautiful yes uh you know Arabic uh
poems or whatever religious text in in
the thing and it was very powerful
Corporation of scribes basically that
kind of you know run a big chunk of the
Empire and you know it couldn't put them
out of business so they you know B the
Ping press in part to protect that
business now what's the analogy for AI
today like who are we protecting by
Banning AI like who are the people who
are asking that AI be regulated to
protect their their jobs and of course
you know there's it's it's a it's a real
question of what is going to be the
effect
of you know technological transformation
like AI on the on the job market and the
labor market and there are economists
who are much more expert at this than I
am but when I talk to them they they
tell
us you know we're not going to run out
of job this this not this is not going
to cause mass unemployment this this
just going to be gradual uh shift of
different professions the professions
are going to be hot 10 or 15 years from
now we have no idea today what they're
going to be the same way if we go back
20 years in the past like who could have
thought 20 years ago that like the
hardest job even like 5 10 years ago was
mobile app developer like smartphones
weren't invented most of the jobs of the
future might be in in the
metaverse well it could be yeah but the
point is you can't possibly predict but
you're right I mean you made a lot of
strong points
and I I believe that people are
fundamentally good and so if
AI especially open source AI can um make
them smarter
it just empowers the goodness in humans
so I I share that feeling okay I think
people are Fally good uh and in fact a
lot of doomers are doomers because they
don't think that people are
fundamentally good uh and they either
don't trust people or they don't trust
the institution to do the right thing so
that people behave
properly well I think both you and I
believe in humanity and I think I speak
for a lot lot of people in saying thank
you for pushing the open source movement
pushing to making both research in AI
open source making it available to
people and also the models themselves
making it open source so thank you for
that and uh thank you for speaking your
mind in such colorful and beautiful ways
on the internet I hope you never stop
you're one of the most fun people I know
and get to be a fan of So yeah thank you
for speaking to me once again and thank
you for being you thank you Le
thanks for listening to this
conversation with Yan laon to support
this podcast please check out our
sponsors in the description and now let
me leave you with some words from Arthur
C
Clark the only way to discover the
limits of the possible is to go beyond
them into the
impossible thank you for listening and
hope to see you next
time