File TXT tidak ditemukan.
File TXT tidak ditemukan.
Transcript
aGBLRlLe7X8 • Oriol Vinyals: Deep Learning and Artificial General Intelligence | Lex Fridman Podcast #306
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0653_aGBLRlLe7X8.txt
Kind: captions
Language: en
at which point is the neural network
a being
versus a tool
the following is a conversation with
arielle vinialis his second time in the
podcast arielle is the research director
and deep learning lead at deepmind and
one of the most brilliant thinkers and
researchers in the history of artificial
intelligence
this is the lex friedman podcast to
support it please check out our sponsors
in the description and now dear friends
here's arielle vinnie alice
you are one of the most brilliant
researchers in the history of ai working
across all kinds of modalities probably
the one common theme is it's always
sequences of data uh so that we're
talking about languages images even
biology and uh games as we talked about
last time so
you're a good person to ask this
in your lifetime will we be able to
build an ai system that's able to
replace me as the interviewer
in this conversation
in terms of ability to ask questions
that are compelling to somebody
listening and then
further question is
are we close
will we be able to build a system that
replaces you
as the interviewee
in order to create a compelling
conversation how far away are we do you
think it's a good question um i think
partly i would say do we want that i
i really like when we start now with
very powerful models interacting with
them
and thinking of them more closer to us
the question is if you remove the human
side of the conversation is that an
interesting you know is that an
interesting artifact and i would say
probably not i've seen for instance um
last time we spoke like was we were
talking about starcraft um and creating
you know agents that play games involves
self-play but ultimately what people
care about was
how does this agent behave when the
opposite side is is a human
so
without a doubt we will probably be more
empowered by ai um maybe you can
source some questions from an ai system
i mean that even today i would say it's
quite plausible that with your
creativity you might actually find very
interesting questions that you can
filter we call this cherry picking
sometimes in the field of language um
and likewise if i had now the tools on
my side i could say look you're asking
this interesting question
from this answer i like the words chosen
by this particular system that created a
few words
completely replacing it feels
not exactly exciting to me um although
in my lifetime i think way i mean given
the trajectory i think it's possible
that perhaps there could be interesting
um maybe self-play interviews as you
you're suggesting that would look look
or sound kind of quite interesting and
probably would advocate or you could
learn a topic through listening to one
of these interviews at a basic level at
least so you said it doesn't seem
exciting to you but what if exciting is
part of the objective function the thing
is optimized over so you can there's
probably a huge amount of data
of humans if you look correctly of
humans communicating online and there's
probably ways to measure the degree of
you know as they talk about engagement
so you can probably optimize the
question that's most
created an engaging conversation in the
past so actually if you strictly use the
word exciting
there is probably
a way to create a optimally exciting
conversations
that are involved ai systems at least
one side is ai yeah that makes sense i
think
maybe looping back a bit to to games and
the game industry when you design
algorithms um you're thinking about
winning as the objective right or the
reward function but in fact when we
discuss this with blizzard the creators
of starcraft in this case i think
what's exciting fun um if you could
measure that and optimize for that
that's probably why we play video games
or why we interact or listen or look at
cat videos or whatever on the internet
so it's true that modeling reward beyond
the obvious reward functions we've used
to in reinforcement learning is
definitely very exciting and again there
is some progress actually into um a
particular aspect of ai which is quite
critical which is um for instance is a
conversation that or is the information
truthful right so you could start trying
to evaluate um these from
except from the internet right that has
lots of information and then if you can
learn a function automated ideally so
you can also optimize it more easily
then you could actually have
conversations that optimize for
non-obvious things such as excitement
so yeah that's quite possible and then i
would say in that case it would
definitely be fun a fun exercise and
quite unique to have at least one site
that is fully driven by an excitement
reward function um but obviously
there would be still quite a lot of
humanity in the system both from who
who is building the system of course and
also
ultimately if we think of labeling for
excitement that those labels must come
from us because it's just
hard to
have a computational measure of
excitement as far as i understand
there's no such thing
you mentioned truth also i would
actually
venture to say that excitement is easier
to label than truth
or is perhaps uh has lower consequences
of failure
but there is
perhaps
the humanness that you mentioned that's
perhaps part of a thing that could be
labeled and that could mean
an ai system that's doing dialogue
that's doing conversations
should be
flawed for example
like that's the thing you optimize for
which is uh have inherent contradictions
by design have flaws by design
maybe it also needs to have a strong
sense of identity
so it has a backstory it told itself
that it sticks to it has memories
not in terms of the how the system is
designed but it's able to tell stories
about its past
it's
able to have
um mortality and fear of mortality in
the following way that it has an
identity
and like if it says something stupid and
gets cancelled on twitter that's the end
of that system so it's not like you get
to rebrand yourself that system is
that's it so maybe that the the
high-stakes nature of it because like
you can't say anything stupid now or oil
because
uh you'll be canceled on twitter and
that there's there's stakes to that and
that i think part of the reason that
makes it uh
interesting and then you have a
perspective like you've built up over
time that you stick with and then people
can disagree with you so holding that
perspective strongly
holding sort of a maybe a controversial
at least a strong opinion all of those
elements it feels like they can be
learned because it feels like there's a
lot of data
on the internet of people having an
opinion
and then combine that with a metric of
excitement you can start to create
something that as opposed to trying to
optimize for uh
sort of
grammatical clarity and truthfulness
the the factual
consistency over many sentences you're
optimized for
the humanness
and there's obviously data for humanness
on the internet
so i wonder
i wonder if there's a future where
that's part
or i mean i i i sometimes wonder that
about myself i'm a huge fan of podcasts
and i listen to poc some podcasts and i
think like what is interesting about
this what is compelling
uh the same way you watch other games
like you said watch play starcraft or
have magnus carlsen play chess
so i'm not a chess player so but it's
still interesting to me and what is that
that's the
uh the stakes of it maybe um the end of
a domination of a series of wins i don't
know there's all those elements
somehow connect to a compelling
conversation and i wonder how hard is
that to replace because ultimately all
of that connects the initial proposition
of how to test
whether an ai is intelligent or not with
the turing test
which i guess my question comes from a
place of the spirit of that test
yes um i actually recall i was just
listening to our first podcast where we
discussed turing tests um so
i would say
from a
neural network you know ai builder
perspective um there's
you know usually you try to map many of
these interesting topics you discuss to
to benchmarks and then also to actual
architectures on the how these systems
are currently built how they learn what
data they learn from what are they
learning right we're talking about
weights of a mathematical function and
then looking at the current state of the
game maybe
what do we
need leaps forward to get to the
ultimate stage of all these experiences
um lifetime experience of fears like
words that currently
barely we're we're seeing um progress
just because what's happening today is
you take
all these human interactions um it's a
large bust of variety of human
interactions online and then you're
distilling these
sequences right going back to my passion
like sequences of words letters um
images sound there's more modalities
here to be to be at play and then you're
trying to
just learn a function that will be happy
that maximizes the the likelihood of
seeing all these um through a neural
network um now
i think there's a few
places where the way currently we train
these models would clearly like to be
able to develop the kinds of
capabilities you save i'll tell you
maybe a couple one is
the lifetime of an agent or a model
so you
you learn from this data offline right
so you're just passively observing and
maximizing this you know it's almost
like a mountains like a landscape of
mountains and then everywhere there's
data that humans interacted in this way
you're trying to make that higher and
then you know lower where there's no
data and then these models generally
don't
then experience themselves these they
just are observers right they're passive
observers of the data and then we're
putting them to then generate data when
we interact with them but that's very
limiting the experience they actually
experience um when they could maybe be
optimizing or further optimizing the
weights we're not even doing that so to
be clear and again mapping to
alphago alpha star we train the model
and when we deploy it um to play against
humans or in this case interact with
humans um like language models they
don't even keep training right they're
not learning in the sense of the weights
that you've
learned from the data they don't keep
changing
now there's something a bit more
feels magical but it's understandable if
you're into neural nets which is well
they might not
learn in the strict sense of the words
the way it's changing maybe that's
mapping to how neurons interconnect and
how we learn over our lifetime but it's
true that
the context of the conversation that
they they that takes takes place with
when you talk to these systems it's held
in their working memory right it's
almost like um you start a computer it
has a hard drive that has a lot of
information you have access to the
internet which has probably all the
information but there's also a working
memory
where the these agents as we call them
or start calling them build upon
now this memory is very limited um i
mean right now we're talking to be
concrete about 2 000 words that we hold
and then beyond that we start forgetting
what we've seen so you can see that
there's some short-term coherence
already right with when you said i mean
it's a very interesting topic um having
sort of a mapping
um an agent to like have consistency
then you know if if you say oh what's
your name um it could remember that but
then it might forget beyond 2000 words
which is not
that long of context if we think even of
these podcast um books are much longer
so
technically speaking there's a
limitation there super exciting from
people that work on deep learning to be
working on
but
i would say we lack maybe benchmarks and
the technology to have
this lifetime
like experience of memory that keeps
building up
um however the way it learns offline is
clearly very powerful right so
i you know you asked me three years ago
i would say oh we're very far i think
we've seen the power of this imitation
again
on the internet scale that has enabled
this um to
feel like at least the knowledge the
basic knowledge about the world now is
incorporated into the weights
but then this
experience is lacking and in fact as i
said we don't even train them when you
know when we're talking to them other
than
their working memory of course is
affected so that's the dynamic part but
they don't learn in the same way that
you and i have learned right when
from basically when we were born and
probably before
so lots of fascinating interesting
questions you asked there i think um
the one i mentioned is this idea of
memory and experience versus just kind
of observe the world and learn its
knowledge which i think for that i would
argue lots of recent advancements that
make me very excited about the field
and then the second
maybe issue that i see is
all these models
we train them from scratch that's
something i would have complained three
years ago or six years ago or 10 years
ago
and it feels
if we take inspiration from how we got
here how the universe evolved us
and we keep evolving it feels
that is a missing piece that we should
not be training models from scratch um
every few months that there should be
some sort of
way in which we can grow models um much
like as a species and many other
elements in the universe is building
from the previous sort of iterations
and that from uh just purely neural
network perspective
even though we we would like to make it
work it's proven very hard to not
you know throw away the previous weights
right this landscape we learn from the
data and you know refresh it with a
brand new set of weights um given
maybe a
recent snapshot of these data sets we
train on etc or even a new game we're
learning so that's
that feels like something is missing
fundamentally we might find it but it's
not very clear how it will look like
there's many ideas and it's super
exciting as well yes just for people who
don't know when you approach a new
problem in machine learning
you're going to come up with an
architecture that has a a bunch of
weights and then you initialize them
somehow which
in most cases is some version of random
so that's what you mean by starting from
scratch and it seems like it's a it's a
waste
every time you
solve uh the game of go in chess
starcraft
uh protein folding like surely there's
some way to reuse the weights as we grow
this giant database of
of
of neural networks that have solved some
of the toughest problems in the world
and so
some of that is um
what is that methods
how to reuse weights
how to learn extract what's
generalizable or at least has a chance
to be
and throw away the other stuff
uh and maybe the neural network itself
should be able to tell you that
like what
um yeah how do you what ideas do you
have for better initialization of
weights maybe stepping back if we look
at the field of machine learning but
especially deep learning
right at the core of deep learning
there's this beautiful idea that is
a single algorithm can solve any task
right so
it's been proven over and over with more
increasing set of benchmarks and things
that were thought impossible that are
being cracked by this basic principle
that is
you take a neural network of
uninitialized ways so like a blank
computational brain
um then you give it
in the case of supervised learning a lot
ideally of examples of hey here is what
the input looks like and the desired
output should look like this i mean
image classification is very clear
example images to maybe one of a
thousand categories that's what imagenet
is like but many many if not all
problems can be mapped this way
and then
there's a generic recipe right that you
can use
um and this recipe with
very little change and i think that's
the core of deep learning research right
that what is the recipe that is
universal that for any new given task
i'll be able to use without thinking
without having to work very hard on the
problem at stake
we have not found this recipe but
i think the field is
excited to find um less tweaks or tricks
that people find when they work on
important problems specific to those and
more of a general algorithm right so at
an algorithmic level i would say we have
something general already which is this
formula of training a very powerful
model a neural network on a lot of data
and in many cases
you need some specificity to the actual
problem you're solving um protein
folding being such an important problem
has some basic recipe that is learned
from beyond before right like
transformer models graph neural networks
um ideas coming from nlp like uh you
know
something called birth that is a kind of
loss that you can in place to help the
model uh
knowledge distillation is another
technique right so this is the formula
we still had to find some particular
things that were specific to alpha fault
right that's very important because
protein folding is such a high value
problem that as humans we should solve
it no matter if we need to be a bit
specific and it's possible that some of
these learnings will apply then to the
next iteration of this recipe that deep
learners are about
but it is true that so far
the recipe is what's common but the
weights you generally throw away which
feels very sad um
although
maybe in the last especially in the last
two three years
and when we last spoke i mentioned this
area of metal learning which is the idea
of learning to learn
that idea and some progress has been had
starting i would say mostly from gpt3 on
the language domain only in which you
could conceive a model that is trained
once and then this model is not narrow
in that it only knows how to translate a
pair of languages or it only knows how
to assign sentiment to a sentence these
these actually
you could teach it by a prompting is
called and this prompting is essentially
just showing it a few more examples um
almost like you do show examples input
output examples algorithmically speaking
to the process of creating this model
but now you're doing it through language
which is very natural way for us to
learn from one another i tell you hey
you should do this new task i'll tell
you a bit more maybe you asked me some
questions and now you know the task
right you didn't need to retrain it from
scratch and we've seen these magical
moments almost um in this way to do
fuchsia prompting through language on
language only domain and then in the
last
two years we've seen these expanded to
beyond language
adding vision adding actions and games
lots of progress to be had but this is
maybe if you ask me like about how are
we going to crack this problem this is
perhaps one way in which you have a
single model
the problem of this model is it's hard
to grow
in weights or capacity but the model is
certainly so powerful that you can teach
it some tasks right in this way that i
teach you i could teach you a new task
now if we were oh let's a text a
text-based task or a classification a
vision style task
but it still feels like more
breakthroughs should be hot but it's a
great beginning right we have a good
baseline we have an idea that this maybe
is the way we want to benchmark progress
towards agi and i think in my view
that's critical to always have a way to
benchmark the community sort of
converging to this overall which is good
to see
and then this is actually
what excites me in terms of also next
steps um for deep learning is how to
make these models more powerful how do
you train them how to grow them if they
must grow should they change their
weights as you teach it the task or not
there's some interesting questions many
to be answered yeah you've opened the
door about
to a bunch of questions i want to ask
but let's first return to the
uh to your tweet and read it like a
shakespeare you wrote gato is not the
end it's the beginning and then he wrote
meow and then an emoji of a cat
uh so first two questions first can you
explain the meow and the cat emoji and
second can you explain what gatto is and
how it works right indeed i mean thanks
thanks for reminding me that we're all
exposing on twitter and
permanently there yes permanently one of
the greatest ai researchers of all time
meow and cat emoji yes there you go
right so can you imagine like touring uh
tweeting
meow and cat probably he would probably
would probably so yeah the tweet
is important actually um you know i put
thought on the tweets i hope people
which part you think okay
so there's three sentences
gato is not the end
gato is the beginning
meow cat emoji okay which is the
important part the meow no no
definitely um that it is the beginning i
mean i i probably was just explaining um
a bit
where the field is going but um let me
tell you about gato so
first the name gato
comes from maybe a sequence of releases
that deepmind had that
named uh like used animal names to name
some of their models that are based on
this idea of large sequence models
initially their only language but we're
expanding to other modalities so we had
a you know we had
gopher
chinchilla these were language only and
then more recently we released flamingo
which adds vision to the equation and
then gato which
adds vision and then also actions in the
mix right um as we discuss actually
actions um especially discrete actions
like up down left right
i just told you the actions but they're
words so you can kind of see how actions
naturally map to sequence modeling of
words which these models are very
powerful
so
gato was named after i believe i can
only from memory right this you know
these things always happen with an
amazing team of researchers behind so
before the release yeah um we had a
discussion about which animal would we
pick right and i think because of the
word general agent right and and this is
a property quite unique to gato um we we
kind of were playing with the ga words
and then you know gato and rice of cat
yes um and gato is obviously a spanish
version of cat i had nothing to do with
it although i'm from spain
wait sorry how do you say cat in spanish
gato oh god okay yeah no okay okay i see
i see i see you now it all makes sense
okay how do you say meow in spanish no
that's
i think you you say it the same way
but you write it uh is
m-i-a-u okay it's universal yeah all
right so then how does the thing work so
you said general is
so you said uh language
vision
and action action
how does this
can you explain what kind of neural
networks are involved what does the
training look like
maybe um
what you are some beautiful ideas within
the system yeah so
maybe the basics of gato are not that
dissimilar from many many work that
comes so here is where the the sort of
the recipe i mean hasn't changed too
much there is a transformer model that's
just the kind of recurrent neural
network
that essentially takes a sequence of
modalities observations that could be
words could be vision or could be
actions and then
its own objective that you train it to
do when you train it is to predict what
the next
anything is and anything means what's
the next action if this sequence that
i'm showing you to train is a sequence
of actions and observations then you're
predicting what's the next action and
the next observation right so you you
think of of this really as a sequence of
bytes right so take any sequence um of
words a sequence of interleaved words
and images a sequence of um maybe um
observations that are images and moves
in atari up down left right and these
you just
think of them as bytes and you're
modeling what's the next byte gonna be
like and you might interpret that as an
action as an action and then play it in
a game or you could interpret it as a
word and then write it down if you're
chatting with the system and so on um
so gato basically
can be but can be thought as inputs
images
text
video
actions
it also actually inputs some sort of
proprioception sensors from robotics
because robotics is one of the tasks
that it's been trained to do and then at
the output similarly it outputs words
actions it does not output images um
that's just by design we decided not to
go that way for now um that's also in
part why it's the beginning because
there's more to do clearly
but that's kind of what the gato is is
this brain that essentially you give it
any sequence of these observations and
and modalities and it outputs the next
step and then you off you go you fit the
next the next step into and predict the
next one and so on now
it is
more than a language model because even
though you can chat with gato like you
can chat with chinchilla or flamingo um
it also
is an agent right so that's
why we call it a of gato like the the
word uh the letter a and also it's
general um it's not an agent that's been
trained to be good at only starcraft or
only atari or only go it's been trained
on a vast variety of data sets so
what makes an agent if i may interrupt
the fact that it can generate actions
yes so
when we call it i mean it's a it's a
good question right what why when do we
call a model i mean everything is a
model but what is an agent in my view is
indeed the capacity to take actions in
an environment that you then send to it
and then the environment might return
with a new observation um and then you
generate the next action this this
actually this reminds me of the question
from the side of biology what is life
which is actually a very difficult
question as well what is living
what is living when you think about life
here on this planet earth and a question
interesting to me about aliens what is
life when we visit another planet would
we be able to recognize it and this
feels like it sounds perhaps silly but i
don't think it is at which point is the
neural network
a being
versus a tool
and it feels like action ability to
modify its environment as that
fundamental leap
yeah i think it's it certainly feels
like action is a necessary condition to
to be
more alive but probably not sufficient
either um yeah so sadly consciousness
thing whatever yeah yeah we can get back
to that later but anyways going back to
the meow and the legato right so
um
one of the
leaps forward and what took the team a
lot of effort and time was
um as you were asking
how has gato been trained so i told you
gato is this transformer neural network
models actions um
sequences of actions words etc
and then the way we train it is by
essentially pulling data sets
of
um observations right so it's a massive
imitation learning algorithm that it it
imitates obviously to
what is the next word that comes next
from the usual data sets we used before
right so these these are these web scale
style data sets of people um writing you
know
on on webs or chatting or whatnot right
so that's an obvious source that we use
on all language work but then
we also took a lot of agents that we
have at deepmind i mean as you know
deepmind we're quite um
you know we're quite interested in
learning um reinforcement learning and
learning agents that play in different
environments so we kind of created a
data set of these trajectories as we
call them or asian experiences so in a
way there are other agents we train for
a single mind purpose to let's say um
you know control a 3d game environment
and navigate a maze so we had all the
experience that was created through the
one agent interacting with that
environment and we added this to the
data set right and as i said we just see
all the data all these sequences of
words or sequences of this agent
interacting with that environment
or you know agents playing atari and so
on we see this as the same kind of data
and so we mix these data sets together
and we train gato
that's the g part right it's general
because it really has mixed it it
doesn't have different brains for each
modality or each narrow task it has a
single brain it's not that big of a
brain compared to most of the neural
networks we see these days it has one
billion parameters
some models we're seeing get in the
trillions these days and certainly 100
billion feels like um
a size that is very common from from
when you train this this job so the
actual
agent is relatively small but it's been
trained on on a very challenging diverse
data set not only containing all of
internet but containing all these asian
experience playing very different
distinct environments
so this
brings us to the part of the tweet of
this is not the end is the beginning it
it feels very cool to see gato
in principle is able to control
any sort of environments um that
especially the ones that it's been
trained to do these 3d games atari games
and all sorts of robotics tasks and so
on
um but
obviously it's not as proficient as the
teachers it learned from on these
environments not obvious
it's not obvious that it wouldn't be
more proficient
it's just the current beginning part
right is that
the performance is such that it's not as
good as if it's specialized to that task
right so
it's not as good although i would argue
size matters here so the fact that i
would argue always size always matter
yeah that's a different color
but but for neural networks certainly
size does matter so um it's the
beginning because it's relatively small
so obviously scaling this idea up um
might make
the
connections that exist between
you know text on the internet and
playing atari and so on more
synergistic with one another and you
might gain and that moment we didn't
quite see but obviously that's why it's
the beginning that synergy might emerge
with scale right my emerge with scale
and also i believe there's some new
research or ways in which you prepare
the data um that might you might need to
sort of make it more clear to the model
that
you're not only playing atari and it's
just you start from a screen and here is
up and a screen and down maybe you can
think of playing atari as there's some
sort of context that is needed for the
agent before it starts seeing oh i'm
this is an entire screen i'm going to
start playing
um you might require for instance to to
be told in words
hey this is the in this in this sequence
that i'm showing you're going to be
playing an entire game
so text might actually be a good driver
to
enhance the data right so then these
connections might be made more easily
right that's that's an idea that we
start seeing
in language but you know obviously
beyond this is going to be effective
right it's not like i don't show you a
screen and and you from from scratch you
you're supposed to learn a game there is
a lot of context we might set so there
are there might be some work needed as
well to set that context um but
anyways there's a lot of work yeah so
that context puts all the different
modalities on the same level ground
exactly provide the context best so
maybe on that point uh so there's this
task which
may not seem trivial of
tokenizing the data of converting the
data into
pieces into basic atomic elements
that then could uh
cross modalities somehow so what's
tokenization
how do you tokenize text how do you
tokenize images how do you tokenize
games and actions and robotics
tasks yeah that's a great question so
tokenization is
the entry point to actually make all the
data look like a sequence because tokens
then are just kind of these little
puzzle pieces we break down anything
into these puzzle pieces and then we
just model what's the what's this puzzle
look like right when you make it you
know lay down in a line so to speak in a
sequence
so
in gato um
the text there's a lot of work you
tokenize text usually by looking at
common commonly used sub strings right
so there's you know ing in english is a
very common substring so that becomes a
token um there's quite well studied
problem on tokenizing text text and gato
just used the standard techniques that
have been developed from many years even
starting from engram models in the 1950s
and so on just for context how many
tokens like what order magnitude number
of tokens is required for a word
yeah actually what are we talking about
yeah for a word in in english right i
mean every language is very different um
the current level or granularity of
tokenization generally
means is maybe
two to five i mean i i don't know the
statistics exactly but to give you an
idea um we don't tokenize at the level
of letters then it would probably be
like i don't know what the average
length of of a word is in english but
that would be you know the the minimum
set of tokens you could use was bigger
than letters smaller than words yes yes
and you could think of very very common
words like v i mean that would be a
single token but very quickly you you're
talking two three four four tokens have
you ever tried to tokenize emojis emojis
are actually just
um
sequences of letters so maybe to you but
to me they mean so much more yeah you
can render the emoji but you you might
if you actually just yeah this is a
philosophical question is emojis an
image or a text
the way
we do these things these things is
they're actually mapped to seek small
sequences of characters yeah so
you can actually play with these models
and input emojis it will output emojis
back um which is actually quite a fun
exercise you probably can find other
tweets about this um out there um but
yeah so anyways tex there's like it's
very clear how this is done and then in
gato
what we did for images is we map images
to essentially we compressed images so
to speak into something that looks more
like um
less like every pixel with every
intensity that would mean we have a very
long sequence right like if we were
talking about 100 by 100 pixel images
that would make the sequences far too
long so what was done there is you just
use a technique that essentially
compresses an image into maybe 16 by 16
patches of pixels and then that is map
again tokenize you just essentially
quantize this space into
a special word that actually maps to
these little sequence of pixels and then
you put the pixels together in some
raster order and then that's how you get
out um or in or in the image that your
your your process but there's no
semantic
aspect to that so you're doing some kind
of you don't need to understand anything
about the image in order to tokenize it
currently no you you're only using this
notion of compression so you're trying
to
find common it's like jpg or all these
algorithms it's actually very similar at
the tokenization level all we're doing
is finding common patterns and then
making sure
in a lossy way we compress these images
given the statistics of the images that
are contained in all the data we deal
with although you could probably argue
that jpeg
does have some understanding of images
like uh
because visual information
maybe color
compressing based crudely based on color
does capture some
something important about an image
that's about its meaning not just about
some statistics
yeah i mean jp as i said is very the
algorithms look actually very similar to
they use this the the
cosine transform in jpg
um the the approach we usually do in
machine learning when we deal with
images and we do this quantization step
is a bit more data driven so rather than
have some sort of fourier basis for how
you know frequencies appear in natural
in the natural world
we actually just use
the statistics of the images and then
quantize them based on the statistics
much like you do in words right so
common subscript sub strings are
allocated a token um and images is very
similar but there's no
connection
the token space if you think of oh like
the tokens are an integer and in the end
of the day so now like we work on
maybe we have about let's say i don't
know the exact numbers but let's say 10
000 tokens for text right certainly more
than characters because we have groups
of characters and so on so from one to
ten thousand those are representing all
the language and the words we'll see and
then images occupy the next set of
integers so they're completely
independent right so from ten thousand
one to twenty thousand those are the
tokens that represent these other
modality images
and
that is an interesting
aspect that makes it orthogonal so what
connects these concepts is the data
right once you have a data set for
instance that captions images that tells
you oh this is someone playing a frisbee
on on a green field now
the model will need to predict the
tokens from the text green field to then
the pixels and that will start making
the connections between the tokens so
these connections happen as the
algorithm learns and then the last if we
think of these integers the first few
are words the next few are images in
gato we also allocated
the the highest
order of integers to actions right which
we discretize and actions
are very diverse right in atari there's
i don't know if 17 discrete actions in
robotics um actions might be torques and
forces that we apply so we just use kind
of similar ideas to compress these
actions into tokens and then
we just
that's how we map now all the space to
this sequence of integers but they
occupy different space and what connects
them is then the learning algorithm
that's where the magic happens so the
modalities are
orthogonal to each other in token space
right so in the input
everything you add you add extra tokens
right and then
you're shoving all of that into one
place yes the transformer and that
transformer that transformer
tries
to look at this gigantic token space and
tries to form some kind of
representation some kind of
unique um
wisdom
about all of these different modalities
how's that
possible are they do if you were to sort
of like put your
psychoanalysis hat on and try to
psychoanalyze this neural network
is it schizophrenic
does it try to given this very few
weights
represent multiple disjoint things and
somehow
have them not interfere with each other
or is this about building on the
um
on the joint strength on whatever is
common to all the different modalities
like what
if you were to ask questions is it
schizophrenic or is it uh does it is it
of one mind
i mean it is it is one mind um and it's
actually the very the simplest algorithm
which um that's kind of in a way how it
feels like the field
hasn't changed since back propagation
and gradient descent was purpose for
learning neural networks so
there is obviously details on the
architecture this has evolved the
current iteration
is still the transformer which is
a powerful
sequence modeling architecture but then
the goal of this
you know setting these weights to
predict the data is essentially the same
as basically i could describe i mean we
described a few years ago alpha star
language modeling and so on right we we
take let's say an atari game um we map
it to a string of numbers that will all
be probably image space and action space
interleaved and all we're gonna do is
say okay
given the numbers you know ten thousand
one ten thousand four ten thousand five
the next number that comes is twenty
thousand six which is in the action
space
and you're just
optimizing these weights be a very
simple
gradient like you know mathematical is
almost the most boring algorithm you
could imagine we settle the weights so
that given this particular instance
these weights are set to maximize the
probability of having seen this
particular sequence of integers for this
particular game
and then
the algorithm does this for many many
many iterations um looking at different
modalities different games right that's
the mixture of the data set we discuss
so in a way it's a very simple algorithm
and
the weights right they're all shared
right so in terms of is it focusing on
one modality or not the intermediate
weights that are converting from these
input of integers to the target integer
you're predicting next those weights
certainly are common and then the way
the tokenization happens there is there
is a special place in the neural network
which is we map this integer like number
1001 to a vector of real numbers like
real numbers um we can optimize them
with gradient descent right the the
functions we learn are actually um
surprisingly differentiable that's why
we compute gradients so this this step
is the only one that this orthogonality
you mentioned applies so
mapping
a certain token for text or image or
actions this
each of these tokens gets its own little
vector of real numbers that represents
this if you look at the field back many
years ago people were talking about word
vectors or word embeddings
these are the same we have word vectors
or embeddings we have image vector or
embeddings and action vector of
embeddings and the beauty here is that
as you train this model if you visualize
these little vectors um it might be that
they start aligning even though
they're independent parameters there
there could be anything but then it
might be that you take the word gato or
cat which maybe is common enough that
actually has its own token and then you
take pixels that have a cat and you
might start seeing
that these vectors look like they align
right so by learning from this vast
amount of data
the model
is realizing the potential connections
between these modalities now i will say
there would be another way at least in
part to not have these
different
vectors for each different modality
for instance when i tell you about
actions in certain space
i'm defining actions by words right so
you could imagine a world in which i'm
not learning
that the action app in atari is its own
number
the action app in atari maybe is
literally the word or the sentence app
in atari right and that would mean we
now leverage much more from the language
this is not what we did here but
certainly it might make these
connections much easier to learn and
also to teach the model to correct its
own actions and so on right so all these
to to say that gato is indeed the
beginning that it is it is a radical
idea to do this this way but there's
probably a lot more to be done and the
results to be more impressive not only
through scale but also through some
new research that will come hopefully in
the years to come so just to elaborate
quickly you mean
one possible
next step
or
one of the paths that you might take
next is
doing the tokenization fundamentally as
a kind of uh
linguistic communication so like you
convert even images into language so
doing something like a crude
semantic segmentation
trying to just assign a bunch of words
to an image that
like have
almost like a dumb entity explaining as
much as you can about the the image and
so you convert that into words and then
you convert games into words and and you
provide the context and words and all of
it
and eventually
getting to a point where everybody
agrees with noam chomsky that language
is actually at the core of everything
that's it's the base layer of
intelligence and consciousness and all
that kind of stuff okay
uh you mentioned early on like psy it's
hard to grow what did you mean by that
because we're talking about scale might
change
uh there might be and we'll talk about
this too like there's a
emergent
there's certain things about these
neural networks that are emerging so
certain like performance we can see only
with scale and there's some kind of
threshold of scale so it
why is it hard to grow something like
this meow network
so
the meow network
is is not it's not hard to grow if you
retrain it yeah what's hard is well we
have now one billion parameters um we
train them for a while we we spend some
amount of work towards building these
these weights that are an amazing
initial brain for doing this kind of
tasks we care about
could we reuse the weights
and expand to a larger brain and that is
extraordinarily hard but also
exciting from a research perspective and
a practical perspective point of view
right so
there's this notion of
modularity in software engineering and
we're starting to see some examples and
work that leverages modularity in fact
if we go back one step from gato to a
work that i would say train much larger
much more capable network called
flamingo flamingo did not deal with
actions but it definitely dealt with
images in in a in an interesting way
kind of akin to what agato did but
slightly different technique for
tokenizing but we don't need to go into
that detail but
what flamingo also did which gato didn't
do and that just happens because these
projects you know they're they're
they're different you know it's a bit of
like the exploratory nature of research
which is great the research behind these
projects is also modular yes exactly um
and it has to be right we need we need
to have creativity um and sometimes you
need to protect pockets of you know
people researchers and so on but we
believe in humans yes okay and also in
particular researchers and maybe even
further you know deep mine or or other
such labs and then they act the neural
networks themselves so it's modularity
all the way down okay all the way down
so the way that we did modularity very
beautifully in flamingo is we took
chinchilla which is a language only
model not an agent if we think of
actions being necessary for agency so we
took chinchilla we took the weights of
chinchilla
and then we froze them we said these
don't change we train them to be very
good at predicting the next word is a
very good language model state of the
art at the time you release it etc etc
going to add a capability to c right we
are going to add the ability to see to
this language model so we're going to
attach
um small pieces of neural networks at
the right places in the model it's
almost like
injecting
the network with some weights and some
substructures
in the ways in a good way right so you
need the research to say what is
effective how do you add this capability
without destroying others etc so we
created a small sub network
initialized not from random but actually
from um self-supervised learning that
you know a model that understands vision
um in general and then
we took data sets that connect the two
modalities vision and language and then
we froze the main part the largest
portion of the network which was
chinchilla that is 70 billion parameters
and then we added a few more parameters
on top train from scratch
and then some others that were
pre-trained from like from with the
capacity to see like it was a it was not
tokenization in the way i described
forgato but it's a similar idea
and then we train the whole system parts
of it were frozen parts of it were new
and all of a sudden we developed
flamingo which is an amazing model that
is essentially i mean describing it is
a chat bot where you can also upload
images and start conversing about images
um but it's also kind of a dialogue
style um
uh chatbot so the input is images and
text and the output is text exactly
um and how many parameters you said 70
billion 70 billion for chinchilla yeah
chinchilla is 70 billion and then the
ones we add on top which kind of almost
is almost like um a way to overwrite its
its little activations so that when it
sees vision it does kind of a correct
computation of what it's seeing mapping
it back towards so to speak um that adds
an extra 10 billion parameters right so
it's total 80 billion the largest one we
released and then
you train it on
a few data sets that contain vision and
language and once you interact with the
model you start seeing that you can
upload an image and start sort of having
a dialogue about the image um which is
actually not something it's it's very
similar and akin to what we saw in
language only this prompting abilities
that it has you can teach it a new a new
vision task right it does things beyond
the capabilities that in theory the data
sets um provided in themselves but
because it leverages a lot of the
language knowledge acquired from
chinchilla it actually has this few shot
learning ability and these emerging
abilities that we didn't even measure
once we were developing the model but
once developed then
as you play with the interface you can
start seeing wow okay yeah it's cool we
can we can upload i think one of the
tweets talking about twitter was this
image from obama that is
placing a weight and and someone is kind
of waiting themselves and and it's kind
of a joke style image and it's notable
because i think andriy carpati a few
years ago said
no computer vision system can can
understand the subtlety of this joke in
this image all the things that go on and
so what we try to do and it's very
anecdotally i mean this is not a proof
that we solved this issue but
it just shows that you can upload now
this image and start conversing with the
model trying to make out if it if it
gets that there's a joke um because the
person waiting themselves don't see that
doesn't see that someone behind is
making the weight higher and so on and
so forth so it's a fascinating
capability
um and it comes from this key idea of
modularity where we took a frozen brain
and we just
added a new capability so
the question is
should we so in a way you can see even
from deepmind we have flamingo that this
this moderate approach um and thus could
leverage the scale a bit more reasonably
because we didn't need to retrain a
system from scratch and the other on the
other hand we had gato which used the
same data sets but then it trained it
from scratch right and so i guess
big question for the community is
should we train from scratch or should
we embrace modularity and this lies like
this goes back to
modularity as a way to grow but reuse
seems like natural and it was very
effective certainly the next question is
if you go the way of modularity
is there a systematic way
of freezing weights and joining
different modalities
across
you know not just two or three or four
networks but hundreds of networks from
all different kinds of places maybe open
source network that looks at weather
patterns
and you shove that in somehow and then
you have networks that uh i don't know
do all kinds of to play starcraft and
play all the other video games and they
you can keep adding them in
without significant effort
like that maybe the effort scales
linearly or something like that as
opposed to like the more network you add
the more you have to worry about the
instabilities created yeah so that that
vision is beautiful i think
um there's still the question about
within single modalities like chinchilla
was reused but now if we train a next
iteration of language models are we
going to use chinchilla or not yeah how
do you swap out chinch right so
there's there's still big questions but
that idea is is actually really akin to
software engineering which we're not
re-implementing you know libraries from
scratch we're reusing and then building
ever more amazing things including
neural networks with software that we're
using so i think this idea of modularity
i like it i think it's here to stay and
that's also why i mentioned it's just
the beginning not the end
you mentioned metal learning so given
this promise of gatto
can we try to redefine this term
that's almost akin to consciousness
because it means different things to
different people throughout the history
of artificial intelligence but what do
you think meta-learning
is and looks like now in the five years
10 years will it look like system i
gotta but scaled
what's your sense of what is
what what does meta learning look like
do you think great with all the wisdom
we've learned so far yeah great great
question maybe it's good to give
another data point looking backwards
rather than forward so
when when we talk
um in 2019
uh
meta learning
meant something that has changed mostly
through the
revolution of gpt3 and beyond so what
meta-learning meant at the time um
was driven by what benchmarks people
care about in metal learning and the
benchmarks were about
a capability to learn about object
identities so it was very much over
fitted to vision and object
classification and the part that was met
about that was that oh we're not just
learning a thousand categories that
imagenet tells us to learn we're gonna
learn
object categories that can be defined
when we interact with the model so
it's interesting to see the evolution
right the way the way this started was
we have a special language that was a
data set a small data set that we
prompted the model with saying hey here
is a new classification task
i'll give you one image and the name
which was an integer at the time of the
image and a different image and so on so
you have a small prompt in the form of a
data set a machine learning data set and
then you got then a system that could
then predict or classify these objects
that you just defined kind of on the fly
so
fast forward
it was
revealed that
language models are future learners
that's the title of the paper so very
good title sometimes titles are really
good so this one is really really good
because that's that's the point of gpt3
that showed that look
sure we can we can focus on object
classification and how what meta
learning means within the space of
learning object categories this goes
beyond or before rather to also omniglot
before imagenet and so on so there's a
few benchmarks to now all of a sudden
we're a bit unlocked from benchmarks and
through language we can define tasks
right so we're literally telling the
model some logical task or little thing
that we wanted to do
we prompted much like we did before but
now we prompt it through natural
language and then
not perfectly i mean these models have
failure modes and that's fine but
no but these models then are now doing a
new task right so they met to learn um
these new capabilities now
now that's where we are now uh flamingo
expanded this to visual and language but
it basically has the same abilities you
can teach it for instance
an emergent property was that you can
take pictures of numbers and then do do
arithmetic with the numbers just by
teaching it oh that's i mean when when i
show you three plus six you know i want
you to output nine and and you show it a
few examples and now it does that so it
went way beyond the oh this image net
sort of category categorization of
images that we were a bit stuck maybe
before um this
revelation moment that happened uh in
2000 i believe it was 19 but it was
after we chat and that way it has solved
metal learning as was previously defined
yes it expanded what it meant so that's
what you say what does it mean so it's
an evolving term um but here is
maybe now looking forward looking at
what's happening um you know obviously
in the community with more modalities um
what we can expect and i would certainly
hope to see the following and this is a
pretty drastic
hope but in five years maybe we chat
again
and
we have a system right a set of weights
that
we can teach it to play starcraft
maybe not at the level of alpha star but
play starcraft a complex game we teach
it through interactions to prompting you
can certainly prompt a system that's
what gato shows to play some simple
atari games so imagine if you start
talking to a system teaching it a new
game showing it examples of you know in
this in this particular game
this user did something good maybe the
system can even play and ask you
questions say hey i played this game i
just played this game did i do well can
you teach me more so
five maybe to ten years these
capabilities
or what meta learning means will be much
more interactive much more rich and
through domains that we were
specializing right so you see the
difference right we built alpha star
specialized
to play starcraft the algorithms were
general but the weights were specialized
and what what we're hoping is that we
can teach a network to play games to
play any game just using games as an
example through interacting with it
teaching it uploading the wikipedia page
of starcraft like this is
in the horizon and obviously their
details need to be to be filled and
research need to be done but that's how
i see metal learning above which is
gonna be beyond prompting it's gonna be
a bit more interactive it's gonna you
know the system might tell us to give it
feedback after it maybe makes mistakes
or it loses a game um but it's
nonetheless very exciting because if you
think about this this way the benchmarks
are already there we just repurpose them
the benchmarks right so in a way
i like to map the space of
what maybe agi means to say okay like
we went 101 performance
in go in chess in starcraft the next
iteration might be
20 performance across quote unquote all
tasks
right and even if it's not as good it's
fine we we actually we have ways to also
measure progress because we have those
special agents specialized agents um and
so on so this is to me very exciting and
these next iteration models are
definitely hinting at that direction of
progress um which hopefully we can have
there are obviously some things that
could go wrong in terms of we might not
have the tools maybe transformers are
not enough then we must there's some
breakthroughs to come which makes the
field more exciting to people like me as
well of course
but that's if i if you ask me five to
ten years you might see these models
that start to look more like
weights that are already trained and
then
it's more about teaching are or make
they're meant to learn what you're
you're trying um
uh you're trying to to induce in terms
of tasks and so on well beyond the
simple now tasks we're starting to see
emerge like you know small arithmetic
tasks and so on so a few questions
around that this is fascinating uh so
that kind of teaching interactive not so
it's beyond prompting says interacting
with the neural network
that's different than the training
process
so it's different than the
optimization over differentiable
uh functions this is already trained and
now you're teaching
i mean um
it's almost like akin to the brain then
the the neurons already set with their
connections on top of that you know
using that infrastructure to build up
further knowledge
okay
so
that's a really interesting distinction
that's actually not obvious from a
software engineering perspective that
there's a line to be drawn
because you always think for a neural
network to learn it has to be retrained
trained and retrained
but maybe
and prompting is a way of
teaching and you'll now work a little
bit of context about whatever the heck
you're trying it to do so you can maybe
expand this prompting capability by
um making it interact that's really
really yeah by the way this is not if
you look at way back um at different
ways to tackle even classification tasks
so this this is this comes from from
like long-standing literature in machine
learning um what i'm suggesting could
sound to some like a bit like um nearest
neighbor so nida's neighbor is almost
the simplest algorithm
uh that you can
that does not require learning so it has
this interesting like you don't need to
compute gradients and what nearest
neighbor does is you quote unquote have
a data set or upload a data set and then
all you need to do is a way to measure
distance between points and then to
classify a new point you're just simply
computing what's the closest point in
this massive amount of data and that's
my answer so you can think of prompting
in a way as you're uploading not not
just simple points and and you know the
metric is not the distance between the
images or something simple it's
something that you compute that's much
more advanced but in a way it's very
similar right you you simply are
uploading some
knowledge to this pre-trained system in
nearest neighbor maybe the metric is
learned or not but you don't need to
further train it and then now you
immediately get a classifier
out of this right now it's just an
evolution of that concept very classical
concept in machine learning which is um
yeah just learning through what's the
closest point closes by some distance
and that's it yeah it's an evolution of
that and i will say
how how i saw metal learning when we
worked um on a few ideas in in 2016 was
precisely through the lens of
nearest neighbor which is very common in
computer vision community right there's
a very active area of research about how
do you compute the distance between two
images but if you have a good distance
metric you you also have a good
classifier right all i'm saying is now
these distances and and the points are
not just images they're like
words or sequences of
words and images and actions that teach
you something new but
it might be that technique wise
those come back and i will say that it's
not necessarily true that you might not
ever train the weights a bit further
some aspect of metal learning some
techniques in metal learning
do actually do a bit of fine tuning as
it's called right they train the weights
a little bit when they get a new task so
as i call the how or or how we're gonna
achieve this
um as a deep learner i'm very skeptic
we're gonna try a few things whether
it's a bit of training adding a few
parameters thinking of this as nearest
neighbor or just
simply thinking of there's a sequence of
words it's a prefix
and that's the new classifier
we'll see right there's there's the
beauty of research but um but what's
what's important is that is a good goal
in itself that i see as very worthwhile
pursuing for the next stages of
not only meta learning i think this is
basically
what's exciting about machine learning
period to me well the and then the
interactive aspect of that is also very
interesting the interactive version of
nearest neighbor
yeah to help you uh
pull out the
classifier from this giant thing
okay
uh is is this the way we can go in five
ten
plus years
uh from any task so sorry from many
tasks to any task
so and what does that mean like what
does it need to be actually trained on
which point is the network had enough
so what um
what does a network need to learn about
this world in order to be able to
perform any task is it just as simple as
language
image and action or do you need
some set of representative images
uh like
if you only see land images will you
know anything about underwater is that
somehow fundamentally different i don't
know those i mean those are upward
questions i would say i mean the way you
put let me maybe further your example
right if if all you see is land images
but you're reading all about land and
water worlds but in books right imagine
like
would that be enough i mean good
question we don't know but i guess maybe
you can you can you can join us if you
want in our quest to find this that's
that's precisely water world yeah yes
that's precisely i mean the beauty of
research and and
that's the the the the
the research business we're in i guess
is to figure this out and ask the right
questions and then iterate with with the
whole community
um publishing like
findings and so on uh but yeah these are
this is a question it's not the only
question but it's certainly as you ask
is is on my mind constantly right and so
we'll we'll need to wait for
maybe the let's say five years let's
hope it's it's not 10 to to see what
what are the answers um
some people will largely believe in
unsupervised or self-supervised learning
of single modalities
and then crossing them
some people might think end-to-end
learning is the answer um modularity is
maybe the answer so we don't know but
we're just definitely excited to find
out but it feels like this is the right
time and we're at the beginning of this
yeah we're finally ready to do these
kind of general
big models and agents
what do you sort of specific technical
thing about gato flamingo
chinchilla
gopher any of these that is especially
beautiful that was surprising
maybe is there something that just jumps
out at you
of course there's the general thing of
like you didn't think it was possible
and then
you realize it's possible in terms of
the generalizability across modalities
and all that kind of stuff or maybe the
how small of a network relatively
speaking god was all that kind of stuff
but is there some weird
little things that were surprising
look i
i'll give you an answer that's very
important because
maybe people
don't quite realize this but
the teams behind these efforts the
actual humans yeah that's maybe the
surprising
um you know obviously positive way so
anytime you see these these
breakthroughs i mean it's easy to map it
to a few people there's people that are
great at explaining things and so on
that's very nice but
maybe the the the learnings or the
method learnings that i get as a human
about this is um sure we can move
forward um
and but the surprising bit is
how
how important are all the pieces of of
these projects
how do they come together so i'll give
you
uh maybe some of the ingredients of
success that are common across these um
but not the obvious ones and machine
learning i i can always always also give
you those but
basically
there is
engineering is critical so so very good
engineering uh because ultimately we're
collecting um data sets right so the the
engineering of data and then of
deploying the models at scale um into
some compute cluster that cannot go
understated that is a huge factor of
success
and it's hard to believe that
details matter so much
we would like to believe that it's true
that there is more and more of a
standard formula as i was saying like
this recipe that works for everything
but then when you zoom into this each of
these projects then you realize the the
the devil is indeed in the details and
then the teams
have to work kind of together towards
these goals um so engineering of data
and obviously clusters and large scale
is very important and then
one that is often
not maybe nowadays it is more more clear
is
benchmark progress right so we're
talking here about multiple months of
you know tens of researchers um and and
and people that are trying to organize
the research and so on working together
and
you don't know that you can get there i
mean it is this this is this is the
beauty like if you're not risking to
trying to do something that feels
impossible you're not gonna get there um
but you need the way to measure progress
so the benchmarks that you build are
critical um i've seen this beautifully
play out in many projects i mean
maybe the one i've seen it more
consistently
which means we we established the metric
actually the community did and then we
leverage that massively is alpha fault
this is a project where
the data the metrics were all there and
all it took was and it's easier said
than done an amazing team working
not to try to find some incremental
improvement and publish which which is
one way to do research that is valid but
aim very high and work literally for
years
to iterate over that process and working
for years with the team i mean
it is it is tricky that also happened
happened to happen partly during a
pandemic and so on um so i think my meta
learning from all these is
the teams are critical to the success
and then if now going to the machine
learning the part that's surprising
is
um
so we like architectures like neural
networks um and
i would say this was a very rapidly
evolving field until the transformer
came so attention might indeed be all
unique which is the title also a good
title although
in hindsight is good i don't think at
the time i thought this is a great title
for a paper but
that that architecture is proving that
the dream of modeling sequences of any
bites
there is something there that will stick
and and i think these these advance in
architectures in in kind of how neural
networks are architecture to do what
they do
um it's been hard to find one that has
been so stable and relatively has
changed very little
since it was invented
five or so years ago so that is a
surprising keeps is a surprise that
keeps recurring into other projects try
to
on a philosophical or technical level
introspect what is the magic of
attention
what is what is the tension
that's attention in people that study
cognition so human attention i think
there's giant wars over what attention
means
how it works in the human mind so what
this very simple looks at what attention
is in your network
from the days of attention is all you
need but broad do you think there's a
general principle that's that's really
powerful here yeah so a distinction
between transformers and lstms which
were what came before and and you know
there was a transitional period where
you could you could use both in fact
when we talked about alpha star we used
transformers and lstms so it was still
the beginning of transformers they were
very powerful but lstms were still very
also very powerful sequence models
so
the power of the transformer
is that it has built in
what we call an inductive bias of
attention
that makes the model when when you think
of a sequence of integers right like we
discussed this before right this is the
sequence of words
um
when you when you have to do very hard
tasks over these words this could be
we're gonna translate a whole paragraph
or we're gonna predict the next
paragraph given ten paragraphs before
there's some
loose
intuition from how we do it as a human
that is very
nicely mimicked and re like replicated
structurally speaking in the transformer
which is this idea of
you're looking for something
right so you're sort of when you're
you you just read a piece of text now
you're thinking what comes next
you might want to re-look at the text
or look it from scratch i mean
literally is is because there's no
recurrence you're just thinking what
comes next and
it's almost hypothesis driven right so
if if i'm thinking the next word that
i'll write is cat or dog okay um
the way the transformer works
almost philosophically is it
has these two hypotheses is it is it
gonna be cat or is it gonna be dark and
then
it says okay if it's cat i'm gonna look
for certain words not necessarily cat
although cud is an obvious word you
would look in the past to see whether it
makes more sense to output cut or dog
and then it does some very
deep computation over the words and
beyond right so it combines the words
and
but
but it has the query as we call it that
is cat
and then similarly for doc right and so
it's it's very it's a very computational
way to think about
look if i'm if i'm thinking deeply about
text i need to go back to to look at all
the texts attend over it but it's not
just attention like what what is guiding
the attention and that was the key
insight from an earlier paper is not
how far away is it i mean how far away
is it is important what what what did i
just write about that's critical but
what you wrote about
10 pages ago might also be critical
so
you're looking not positionally but
content-wise right and you transformers
have this beautiful way to query for
certain content and pull it out com in a
compressed way so then you can make a
more informed decision i mean that's one
way to explain transformers um but i
think it's it's very it's a very
powerful inductive bias
there might be some details that might
change over time but
i think that is
what makes transformers so much more
powerful than the recurrent networks
that were more recently biased based
which obviously works in some tasks but
it has major flaws transformer itself
has flaws
and i think the main one the main
challenge is these prompts that we we
just were talking about
they can be a thousand words long but if
i'm teaching you starcraft i mean i'll
have to show you videos i have to i have
to point you to whole wikipedia articles
about the game um we'll have to interact
probably as you play you'll ask me
questions the context require for us to
achieve
me being a good teacher to you on the
game as you would want to do it with a
model
what i think goes well beyond the
current capabilities um so the question
is how do we benchmark this and
then how do we change the structure of
the architectures i think there's ideas
on both sides but
we'll have to see empirically right
obviously what ends up working
and as as you talked about some of the
ideas could be you know keeping the
constraint of that length in place but
then forming like hierarchical
representations
to where you can start being much clever
in how you use those thousand tokens
yeah that's really interesting but it
also is possible that this attention
mechanism where you basically you don't
have a recency bias but you you you look
more generally you you make it learnable
the mechanism in which way you look back
into the past you make that learnable
it's also possible where at the very
beginning of that
because
that
you might become smarter and smarter in
the way
you query the past
so recent past and distant past and
maybe very very distant path so almost
like the attention mechanism
will have to
improve and evolve as good as the
uh the
tokenization mechanism where so you can
represent long-term memory somehow yes
and i mean hierarchies are are very i
mean it's a very nice word that sounds
appealing um there's lots of work adding
hierarchy to the memories um in practice
it does seem like we keep coming back to
the main formula or main
architecture
that sometimes tells us something
there's such a sentence that a friend of
mine told me like whether it wants to
work or not so transformer was clearly
an idea that wanted to work and then
i think there's some principles we
believe will be needed but finding the
exact details details matter so much
right that's gonna be tricky i love the
idea that there's like
you as a
human being you want you want some ideas
to work and then there's the model that
wants some ideas to work and you get to
have a conversation to see which
more likely the model will win in the
end
because it's the one you don't have to
do any work the model is the one that
has to do the work so you should listen
to the model and i really love this idea
that you talked about the humans in this
picture if i could just briefly ask um
one is you're saying
the benchmarks
about the modular humans working on this
uh the benchmarks providing a sturdy
ground of a wish to do these things that
seem impossible
they they give you
in the darkest of times give you hope
because little signs of improvement you
get you could yes like you're not you're
somehow you're not lost if you have
metrics to measure your your improvement
and then there's other aspect
you said elsewhere and here today like
titles matter
i wonder
how much humans matter in the evolution
of all this
meaning individual humans
you know something about their
interaction something about their ideas
how much they
change the direction of all this like if
you change the humans in this picture
like is is it that the model is sitting
there
and it wants you it wants some idea to
work or is it the humans or maybe the
model is providing you 20 ideas that
could work and depending on the humans
you pick they're they're going to be
able to hear some of those ideas like in
in all the because you're now directing
all of deep learning at deepmind you get
to interact with a lot of projects a lot
of brilliant researchers
um
how much variability is created by the
humans in all of this yeah i mean you i
do believe humans matter a lot at the
very least at the
you know time scale of years
on when things are happening and what's
the sequencing of it right so you get to
interact with
people that i mean you mentioned this um
some people really
want some idea to work and they'll
persist um and then some other people
might be more practical like i don't
care
what idea works i care about you know
cracking protein folding yes um and
these at least these two kind of seem
opposite sides we need both and we've
clearly had both
um historically and that made certain
things happen earlier or later so
definitely humans involved in all of
this endeavor have had
i would say years of change or of
ordering how how things have happened
which breakthroughs came before which
other breakthroughs and so on so
certainly that does happen
and so
one other maybe one other axis of
distinction is
what i called and this is most commonly
used in reinforcement learning is the
exploration exploitation trade-off as
well it's not exactly what i meant
although quite related so
when you
start
trying to help others right like you
you're you're you know you're you become
a bit more of a mentor to a large group
of people beat a project or the deep
learning team or something um or even in
the community when you interact with
people in conferences and so on
um you're identifying
quickly right um some some things that
are explorative or exploitative and
it's tempting to try to guide people
obviously i mean that's what makes like
our experience we bring it and we try to
shape things um sometimes wrongly and
there's many times that i've been wrong
in the past that's great
but
it would be wrong to
dismiss any sort of
of the research styles that i'm
observing um and i often get asked well
you're in industry right so we do have
access to large compute scale and so on
so there's certain kinds of research i
almost feel like we need to do
responsibly and so on but it is kind of
we have the particle accelerator here so
to speak in physics so we need to use it
we need to answer the questions that we
should be answering right now for the
scientific progress but then at the same
time i look at many advances including
attention which was discovered
in montreal initially because of lack of
compute right so we were working on
sequence to sequence um with with my
friends over at google brain at the time
and we were using i think eight gpus
which was somehow a lot at the time and
then i think montreal was a bit more
limited in the scale but then they
discovered this content-based attention
concept that then has obviously
triggered things like transformer not
everything obviously starts transformer
there's there's always a history that is
is important to recognize because then
you can make sure that then those who
might feel now well we don't have so
much compute
you need to then
help them
optimize
that the kind of research that might
actually produce amazing change perhaps
it's not
as short term as some of these
advancements or perhaps it's a different
time scale but um the people and the
diversity of the field is quite critical
to that we maintain it and at times
especially mixed a bit with hype or
other things it's it's a bit tricky to
be observing um maybe too much of the
same thinking across the board um but
the humans definitely are critical and i
can think of yeah quite a few personal
examples where also
someone told me something that had a
huge you know huge effect on on to some
idea and then that's why i'm saying at
least at the temp in terms of years
probably some things do happen yeah and
it's also fascinating how constraints
somehow are essential for innovation
um
and the other thing you mentioned about
engineering i have a sneaking suspicion
maybe i
over
you know my love is with engineering so
i have a sneaking suspicion that all the
genius
a large percentage of the genius is in
the tiny details of engineering
so like i think
we like to think our genius our the
genius is in the big ideas
there's i have a sneaking suspicion that
like because i've seen the genius of
details of engineering details
make uh
like the make the night and day
difference and i wonder if those kind of
have a ripple effect over time
so that that too so that's that's sort
of the taking the engineering
perspective that sometimes that quiet
innovation at the level of an individual
engineer or maybe at the small scale of
a few engineers can make all the
difference that scales
because we're doing
we're working on computers that are
scaled across large groups
that one engineering decision can lead
to ripple effects yes it's interesting
to think about yeah i mean engineering
there's also
kind of a historical
it might be a bit random
because if you think of the history of
how especially deep learning and neural
networks took off feels like
a bit random because gpus happened to be
there at the right time for a different
purpose which was to play video games so
even the engineering that goes into the
hardware
and it might have a time like the time
frame might be very different i mean
these the gpus were evolved throughout
many years where we didn't even were
looking at that right so even at that
level right that revolution so to speak
um
the ripples are like like
we'll see when they stop right but in
terms of thinking of why is this
happening right there's there's i think
that when i try to categorize it in sort
of
things that might not be so obvious i
mean clearly there's a hardware
revolution we are
surfing thanks to that um data centers
as well i mean data centers
are where like i mean at google for
instance obviously they're serving
google but there's also now thanks to
that and to have built such amazing data
centers we can train these models um
software is an important one i think
if i look at the state of how i had to
implement things to implement my ideas
how i discarded ideas because they were
too hard to implement
yeah clearly the chat the times have
changed and thankfully we are in a much
better software position as well
and then
i mean obviously there's research that
happens at scale and more people enter
the field that's great to see but it's
almost enabled by these other things and
last but not least is also data right
curating data sets labeling data sets
these benchmarks we think about maybe
we'll we'll want to have all the
benchmarks in one system but it's still
very valuable that someone put the
thought and the time and the vision to
build certain benchmarks we've we've
seen progress thanks to but
we're gonna repurpose the benchmarks
that's the beauty of atari
is like
we solved it in a way but
we use it in gato it was critical and
i'm sure it's there's there's still a
lot more to do thanks to that amazing
benchmark that someone took the time to
put even though at the time maybe
oh you have to think what's the next
you know iteration of architectures
that's what maybe the field recognizes
but we need to that's another thing we
need to balance in terms of humans
behind we need to recognize all these
aspects because they're all critical and
we tend to
yeah we tend to think of the genius the
scientists and so on but i'm i'm glad
you're i know you have a strong engineer
and background so but also i'm a date
i'm a lover of data and because it's a
pushback on the engineering comment
ultimately could be the the creators of
benchmarks who have the most impact
andre capati who you mentioned has
recently been talking a lot of trash
about imagenet which he has the right to
do because of how critical he is about
him
how essential he is to the development
and the success of deep learning around
uh imagenet and you're saying that
that's actually that benchmark is
holding back the field
because i mean especially in his context
on tesla autopilot that's looking at
real world behavior of a system
it's
you you there's something fundamentally
missing about imagenet that doesn't
capture the real worldness of things
that we need to have the datasets
benchmarks that
have the impressive unpredictability the
edge cases the whatever the heck it is
that makes the real world so comp so
difficult to operate in we need to have
benchmarks with that so
but
just to think about the impact of
imagenet as a benchmark
and
that really puts a lot of emphasis on
the importance of a benchmark both sort
of internally a deep mind and as a
community so
one is coming in from within like
how do i create a benchmark for me
to
mark and make progress and how do i make
benchmark for the community to mark
and uh push um
progress you you uh you have this
amazing paper you co-authored a survey
paper called emergent abilities of large
language models
has again the philosophy here that i'd
love to ask you about
what's the intuition about the phenomena
of emergence in neural networks
transform is language models
is there a magic
threshold beyond which we start to see
certain performance
and is that different from task to task
is that us humans just being poetic and
romantic or is there literally some
level of which we start to see
breakthrough performance
yeah i mean this is a property that we
start seeing
um in systems that actually
tend to be
so in machine learning traditionally
again going to benchmarks i mean if if
you have a some input outputs right like
that is
just a single input and a single output
you generally
um when you train these systems you see
reasonably smooth
curves when you analyze how
how much the data
set size affects the performance or how
the model size affect the performance or
how much you long train you how how long
you train the system for
affects the performance right so
you know if we think of imagenet like
the train curves look
fairly smooth and predictable in a way
um
and
i would say that's probably because of
the
it's kind of a one
a one hop
um
reasoning task right it's like here is
an input and you think for a few
milliseconds or 100 milliseconds 300 as
a human and then you tell me yeah
there's
there's an alpaca in this image
so
in language
we are seeing benchmarks that require
more pondering and more
thought in a way right this is just kind
of you you you need to look for some
subtleties
that it involves
inputs that you you might think of or if
even if the input is a sentence
describing a mathematical problem um
there is there is a bit more processing
required as a human and more
introspection so
i think
the
how these benchmarks work
means that there is actually a threshold
um
just going back to how transformers work
in this way of querying for the right
questions to get the right answers that
might mean that
performance becomes random
until the right question is asked by the
querying system of a transformer or of a
language model like a transformer and
then
only only then you might start seeing
performance going from random to
non-random
and
this is more empirical there's there's
no formalism or theory behind this yet
although it might be quite important but
we're seeing these phase transitions of
random performance and until some let's
say scale of a model and then it goes
beyond that and it might be that
you need to fit
a few
low order bits of thought
before you can make progress on the
whole task and if you could measure
actually
those breakdown of the task maybe you
would see more smooth oh like yeah these
you know once once you get these and
these and these and this and these then
you start making progress in the task
but it's somehow
um a bit annoying because then
it means that certain
questions we might ask about
architectures
possibly cannot only be done at certain
scale and
one thing that
conversely i've seen great progress on
in the last couple years is this notion
of science of deep learning
and science of scale in particular right
so
on the negative is that there's some
benchmarks for which progress might need
to be measured at at minimum at a
certain scale until you see then what
details of the model matter to make that
performance better right so that's a bit
of a con but
what we've also seen is that you can
you can sort of empirically analyze
behavior of models at scales that are
smaller right so let's say to put an
example um we had this chinchilla paper
that revised the so-called scaling laws
of models and that whole study is done
at a reasonably small scale right maybe
hundreds of millions up to one billion
parameters and then the cool thing is
that you create some loss right some
loss that some trends right you you
extract trends from data that you see
okay like it looks like the amount of
data required to train now a 10x larger
model would be this and these laws so
far these extrapolations have helped us
save compute and just get to a better
place in terms of the science of
how should we run these models at scale
how much data how much depth and all
sorts of questions we start asking
extrapolating from small scale but then
this emergence is sadly that not
everything can be extrapolated from
scale depending on the benchmark and
maybe the harder benchmarks are not so
good for extracting these laws but we
have a variety of benchmarks at least so
i wonder
to which degree
the threshold the phase shift
scale is a function of the benchmark
some some of that some of the science
the scale might be
engineering benchmarks
where that threshold is low
sort of taking
a main benchmark
and uh reducing it somehow or the
essential difficulties left but the
emergent the scale at which the
emergence happens is lower just for the
science aspect of it versus the actual
real world aspect yeah so luckily we
have quite a few benchmarks some of
which are simpler or maybe they're more
like i think people might call this
systems one versus systems2 style um so
i think what we're not seeing luckily is
that
extrapolations from maybe slightly more
smooth or simpler benchmarks are
translating to the harder harder ones
but that is not to say that this
extrapolation will hit its limits and
when it does
then
how much we scale or how we scale will
sadly be a bit suboptimal until we find
better loss right um and these laws
again are very empirical loss they're
not like physical loss of models
although i wish
there would be better theory about these
things as well but so far i would say
empirical theory as i call it is way
ahead than actual theory of machine
learning
let me ask you
almost for fun so this is not auriel as
a as a deep mind person or anything to
do with deep mind or google just as a
human being and looking at these news of
a google engineer who claimed
uh
that
uh i guess the lambda language model was
sentient or
had the i still need to look into the
details of this
but
sort of
making an official report
and the claim that he believes there's
evidence
that this system is has achieved
sentience and i think
this is a really interesting case
on a human level and a psychological
level on a
technical machine learning level of how
language models transform our world and
also just philosophical level of the
role of ai systems
in um in a human world so
what did you what do you find
interesting
what's your take on all of this as
a machine learning engineer and a
researcher and also as a human being
yeah i mean a few reactions
um quite a few actually have you ever
briefly thought is this thing sanctuary
right so never absolutely like even with
like alpha star wait a minute what uh
sadly though i think yeah sadly i i have
not um yeah i think i think the current
any of the current models although very
useful and very good um
yeah i think we're quite far from that
and there's kind of a converse
side story so one of one of the my
passions is about science in general and
i think
i feel i'm a bit of like a failed
scientist that's why i came to machine
learning because you always feel and you
start seeing this that machine learning
is maybe
the science that can help other sciences
as we've seen right like you you know
it's such a powerful tool um so
thanks to that angle right that okay i
love science i love i mean i love
astronomy i love biology but i'm not an
expert and i decided well the thing i
can do better at these computers but
having especially with when i was a bit
more involved in alpha fault learning a
bit about proteins and about biology and
about
life
um
the complexity
it feels like it really is like i mean
if you start looking at
the things that are going on um
at you know at that atomic level um
and and also i mean there's there's
obviously that
we are maybe inclined to try to think of
neural networks as like the brain but
the complexities
and the amount of magic that it feels
when i mean i don't i'm not an expert so
it naturally feels more magic but
looking at biological systems as opposed
to these computer
computational brains
just makes me like wow this there's such
level of complexity different still
right like orders of magnitude
complexity that um
sure these weights i mean we train them
and they do nice things but they're not
at the level
of biological
entities brains
cells
it just feels like it's just not
possible to achieve the same level of
complexity
behavior and but my belief when i talk
to other beings is certainly shaped by
this amazement of biology that maybe
because i know too much i don't have
about machine learning but i certainly
feel it's very far
fetched and far in the future to be
calling um
or to be thinking well this this this
mathematical function that is
differentiable is is um is in fact
sentient and so on so there's something
on that point it's very interesting so
you know enough
about machines and enough about biology
to know that there's many orders of
magnitude of difference in
complexity but
you know how machine learning works
so the interesting question from human
beings that are interacting with the
system that don't know about the
underlying complexity
and i've seen people probably including
myself that have fallen in love with
things that are quite simple
yeah so and and so maybe the complexity
is one part of the picture but maybe
that's not a necessary
um
that's not a necessary condition for
sentience for um perception
uh or emulation of sentience right so i
mean i guess the other side of this is
that's how i feel personally i mean you
asked me about the person right um now
it's very interesting to see how other
humans feel about things right this is
this we are like um again like i'm i'm
not as amazed about things that i feel
like this is not as magical as this
other thing because of maybe yeah how i
got to learn about it and how i see the
curve a bit more smooth because i you
know like just seen the progress of
language models since shannon in the 50s
and
actually looking at that time scale
we're not that fast progress right i
mean it's what what we were thinking at
the time like almost 100 years ago
is not that dissimilar to what we're
doing now but at the same time yeah
obviously others my experience right
that the personal experience
i think no one should um you know i
think no one should
should should tell others how they
should feel i mean the feelings are very
personal right so how others might feel
about the models and so on that's one
part of the story that is important to
understand for me personally as a
researcher and then
when i maybe disagree or i don't
understand or see that yeah maybe this
this is not something i think right now
is reasonable knowing all that i know
one of the other things and perhaps
partly why it's great to be talking to
you and reaching out to the world about
machine learning is hey
let's make let's demystify a bit the
magic and try to see a bit more of the
math and the fact that literally to
create these models if we had the right
software it would be 10 lines of code um
and then just a dump of the internet so
versus like then the complexity of like
the
creation of humans um from from their
inception right and also the complexity
of evolution of the whole universe to
where we are um that is feels orders of
magnitude more complex and fascinating
to me so i think
yeah maybe part of the only thing i'm
thinking about
trying to tell you is yeah i i think
explaining a bit of the magic there is a
bit of magic it's good to be in love
obviously with what you do at work and
i'm certainly fascinated and surprised
quite quite often as well but i think
hopefully as
experts in biology hopefully will tell
me this is not as magic and i'm happy to
learn that um through through
interactions with the larger community
we can
also have a certain level of education
that
in practice also will matter because i
mean one question is how you feel about
this but then the other very important
is
you starting to interact with this in
products and so on um it's good to
understand a bit what's going on what's
not going on and what's safe what's not
safe and so on right otherwise um the
technology will not be used properly for
good which is obviously the goal of all
of us i hope
so let me then ask the next question do
you think in order to solve intelligence
or to do to replace the lex bot that
does interviews as we started this
conversation with do you think
the system needs to be
sentient do you think he needs to
achieve something
like consciousness and do you think
about what consciousness is in the human
mind
that could be instructive for creating
ai systems
yeah
honestly i think probably not
to to the degree of intelligence that
there's
this brain that
can learn can be extremely useful can
challenge you can teach you
um converse you can teach
it to do things i'm not sure it's
necessary personally speaking
but
if consciousness or any other biological
or evolutionary
lesson
can be
repurposed to
then influence our next set of
algorithms that is a great that is a
great way to actually make progress
right and the same way i try to explain
transformers a bit how it feels we
operate when we look at text
specifically
these insights
are very important right so there's a
distinction between
um
details of how the brain might be doing
computation um i think
my understanding is sure there's neurons
and there's some resemblance to neural
networks but we don't quite understand
enough of the brain in detail right to
to be able to replicate it but then
more
if you if you zoom out a bit how we then
our thought process how memory works um
maybe even how evolution got us here
what's exploration exploitation like all
the how these things happen i think this
clearly can inform algorithmic level
research and i've seen some examples um
of these being quite useful to then
guide the research even it might be for
the wrong reasons right so i think
um
biology and what we know about ourselves
can help a
whole lot to build um essentially like
what we call agi this this general um
the real gato right the the last step of
the chain hopefully but
but consciousness in particular i don't
i don't myself at least think too hard
about
how to add that to to the system but
maybe maybe my understanding is also
very personal about what it means right
i think this even even that in itself is
a long debate that i know people uh
people have often
and maybe i should learn more about this
yeah and i personally
i notice the magic often on a personal
level especially with physical systems
like robots i have a lot of uh
legged robots now in austin that i play
with and even when you program them when
they do things you didn't expect
there's an immediate
anthropomorphization
and you notice the magic and you start
to think about things like sanctions
that has to do more with effective
communication and less with any of these
kind of dramatic things
it um
it seems like a useful part of
communication
having the perception
of consciousness
seems like useful for us humans we we
treat each other more seriously we are
able to uh do a nearest neighbor
shoving of that entity into your memory
correctly all that kind of stuff seems
useful at least to fake it even if you
never make it so maybe like yeah
mirroring the question
and since you talk to a few people do
you then you do think that
we'll need to figure something out
in order to achieve
intelligence in a grander sense of the
world yeah i i personally believe yes
but i don't even think it'll be like a
separate island we'll have to travel to
i think it will emerge very quite
naturally okay that's easier than for us
then thank you but the reason i think
it's important to think about is you
will start i believe like with this
google engineer you'll start seeing this
a lot more especially when you have ai
systems that are actually interacting
with human beings that don't have an
engineering background
and we have to prepare for that
because there will be i do believe there
will be a civil rights movement for
robots as silly as as it is to say
there's going to be a large number of
people that realize there's these
intelligent entities with whom i have a
deep relationship and i don't want to
lose them they've come to be a part of
my life and they mean a lot they have a
name they have a story they have a
memory and we start to ask questions
about ourselves well
what uh this thing sure seems like it's
capable of suffering
because it tells all these stories of
suffering it doesn't want to die and all
those kinds of things and we have to
start to ask ourselves questions what is
the difference between a human being in
this thing and wait so when you engineer
i believe
from an engineering perspective like a
deep mind or anybody that builds systems
there might be laws in the future where
you're not allowed to engineer systems
with
displays of sentience
unless
they're
explicitly designed to be that unless
it's a pet so if you if you have a
system that's just doing customer
support
you're legally not allowed to display
sentience we'll start to like ask
ourselves that question
and then so that that's that's going to
be part of the software engineering
process do we do we which features do we
have in one of them as
communications essentials but it's
important to start thinking about that
stuff especially how much it captivates
public attention
yeah absolutely absolutely it's a it's
definitely a topic that
is important we
think about and i think in a way i i
always see not not i mean not not every
movie is is is equally
on point with certain things but
certainly science fiction in this sense
at least has prepared society to to
start thinking about certain topics that
even if it's too early to talk about as
long as we are like reasonable um it's
certainly gonna prepare us for for both
um the research to come and how to i
mean there's many important challenges
and and topics that um come with with
building an intelligent system many of
which you just mentioned right so
i think
being we're never going to be
fully ready unless we talk about this
and we start also
as i said just kind of expanding
the
the people we talk to to not include
only our our own researchers and so on
and in fact places like deepmind but
elsewhere
there's more interdisciplinary
groups forming up to start asking and
really working with us on these
questions um because obviously this is
not initially what your passion is when
you do your phd but certainly it is
coming right so it's it's fascinating
kind of it's it's the the thing that
brings me to
one of my passions that is learning so
the in this sense this is kind of a new
area that
as a learning system myself i want to
keep exploring and i think it's it's
great that um to see you know parts of
the debate and and even i seen a level
of maturity in the conferences that deal
with ai if you look five years ago um
to now just the amount of workshops and
so on has changed so much is is
impressive to see how much topics of um
you know safety ethics and so on come to
to the surface which is great and if you
were too early clearly it's fine i mean
it's a big field and there's lots of
people um with lots of um interest that
will do progress or make progress um and
obviously i don't believe we're too late
so in that sense like i think it's great
that we're doing this already it's
better be too early than yeah too late
when it comes to super intelligent
systems let me ask speaking of sentient
ais you gave props to your friend alias
giver
for being elected uh the fellow of the
world society so just as a shout out to
a fellow researcher and a friend what's
the secret to the genius of elias
discover
and also do you believe that his tweets
of as youth hypothesized and andre
kapathi did as well are generated by a
language model uh yeah
so
i
i strongly believe i ilia is going to
visit in a few weeks actually so i'll
ask him in person um
but will he tell you the truth yes of
course yeah absolutely i mean we're you
know ultimately we we all have share
paths and and there's friendships that
go beyond obviously institutional
institutions and and so on so i hope he
tells me the truth well maybe the ai
system is holding him hostage somehow
maybe he has some videos about he
doesn't want to release so maybe
he it has taken control over him so he
well i if i see him in person then he
will he will know yeah
but but i think the
um i think it's a good i think elia's
personality just knowing him for a while
um
yeah he's
he's everyone in twitter i guess gets a
different persona and and i think elias
one
um
does not surprise me right so i think
knowing ilia from before social media
and before ai was so prevalent i
recognized a lot of his characters so
that's something for me that i feel good
about a friend that hasn't changed or
like is still true to himself right um
obviously there is there is though a
fact that
your field becomes more popular and he
is obviously one of the main figures in
the field having done a lot of
advancement so i think that the tricky
bit here is how to balance your true
self with the responsibility that your
words carry so
in this sense i think yeah like i i i
appreciate the style and i understand it
but um
it created debates on like some some of
his tweets right that maybe it's good we
have them early anyways right but um but
yeah it's it's then the reactions are
usually polarizing i think we're just
seeing kind of the reality of social
media a bit there as well reflected on
on that on that particular topic or set
of topics he's tweeting about yeah i
mean it's funny that he speak to this
tension he was one of the early
seminal figures in the field of deep
learning and so there's a responsibility
with that but he's also
from having interacted with him quite a
bit
he's just a brilliant thinker about
ideas
and um
which
as as are you and that there's a tension
between becoming the manager versus like
the actual thinking through very novel
ideas
the
yeah the the scientist versus the
manager
and he's
uh he's one of the great scientists of
our time this was quite interesting and
also people tell me quite silly which i
haven't quite detected yet but um in
private we'll have to see about that
yeah
yeah i mean just just on the point of i
mean ilia has been a
inspiration
um i mean quite a few colleagues i can
think shaped you know the person you are
like ilia certainly
gets probably the top spot if not close
to the top and
if we go back to the question about
people in the fields like how the role
would have changed the field or not i
think ilia's case is interesting because
he really has a deep belief in the
scaling up of neural networks there was
a talk that that that is still famous to
this day um from the sequence to
sequence paper um where where he was
just claiming just give me
supervised data and large neural network
and then you know you'll solve basically
all the problems right that that that
vision right was already already there
many years ago so it's it's good to see
like someone who's in this case very
deeply
into this style of research um and
clearly has had a tremendous
track record of successes and so on um
the funny bit about that talk is that we
rehearsed the talk in a hotel room
before and
the original version of that talk would
have been even more controversial so
maybe i'm i'm the only person that has
seen the unfiltered version of the talk
um and you know maybe when the time
comes maybe we should revisit some of
the
the skip slides from from the from from
the talk from emilia but i really think
um
the deep belief into some certain style
of research pays out right is is is good
to be practical sometimes and i actually
think ilya and myself are like practical
but it's also good there's some sort of
long-term
belief and trajectory um obviously
there's a bit of luck involved but it
might be that that's the right path then
you clearly are ahead and and hugely
influential to the field as he has been
do you agree with that intuition that
maybe uh
was
written about by rich sutton in
the the bitter lesson that the biggest
lesson that can be read from 70 years of
ai research is that general methods that
leverage computation are ultimately the
most effective
do you think
that intuition is ultimately correct
general methods
leverage computation
allowing the scaling of computation to
do a lot of the work
and so you the basic task of us humans
is to design methods that are more and
more general versus more and more
specific to the tasks at hand
i i certainly think this
essentially mimics a bit of the deep
learning
um research
um
almost like philosophy
that
on the one hand we want to be data
agnostic we don't want to pre-process
data sets we want to see the bytes right
like the true data as it is and then
learn everything on top so
very much agree with that
and i think scaling up feels at the very
least again necessary for
building incredible complex systems um
it's possibly not sufficient
bearing that we need a couple of
breakthroughs um i think rich saturn
mentioned
search being part of the equation of
skill skill and search i think search
i've seen it
that's been more mixed in my experience
so from that lesson in particular search
is a bit more tricky because
it is very appealing to search in
domains like go where you have a clear
reward function that you can then
discard some search traces
but then
in some other tasks it's not very clear
how you would do that although recently
one of our
recent works which actually was mostly
mimicking
or a continuation and even the team and
the people involved were pretty much uh
very like intersecting with alpha star
was alpha code in which we actually saw
the bitter lesson how scale of the
models and then a massive amount of
search yielded this kind of very
interesting result of being able to
have human level code competition so
i've seen examples of it being literally
mapped to search and scale um i'm not so
convinced about the search bit but
certainly i'm convinced skill will be
needed so we need general methods we
need to test them and maybe we need to
make sure that we can scale them given
the hardware that we have in practice
but then maybe we should also shape how
the hardware looks like um based on
which methods might be needed to scale
and that's an interesting
and an interesting contrast of these gpu
comments that is we got it for free
almost because games were using this but
maybe now if sparsity is required
we don't have the hardware although in
theory i mean many people are building
different kinds of hardware these days
but there's a bit of this notion of
hardware lottery for scale that might
actually
have an impact at least on the year
again scale of years on how fast we'll
make progress to to maybe a version of
neural nets or or whatever comes next
that
might enable
truly intelligent agents
do you think in your lifetime we will
build an agi system
that
would
undeniably be a thing that achieves
human level intelligence and goes far
beyond
i definitely think it's possible
um
that it will go far beyond but i'm
definitely convinced that it will be
human-level intelligence
um and i'm i'm hypothesizing about the
beyond because
the beyond beat
is a bit tricky to define
especially when we look at the current
formula of
starting from this imitation learning
standpoint right so we can certainly
imitate
humans
um at language and beyond
so getting at human level through
imitation feels very possible
going beyond
will require reinforcement learning and
other things and i think in some areas
that certainly already has paid out i
mean go being an example that's my
favorite so far in terms of going beyond
human capabilities but in general
i'm not sure we can define reward
functions
that from a seat of imitating human
level intelligence that is general and
then going beyond um that that bit is
not so clear in my lifetime but
certainly
um human level yes and i mean that in
itself is already quite powerful i think
so um going beyond i think it's
obviously not we're not gonna not try
that if if if it then we get to
superhuman
and discovery and advancing the world
but um but at least human level is also
in general is also very very powerful
well especially if human level or
slightly beyond is integrated deeply
with human society and there's billions
of agents like that
uh do you think there's a singularity
moment beyond which
our world will be just
very deeply transformed by these kinds
of systems because now you're talking
about intelligence systems that are
just i mean this is no longer just a
going from
horse and buggy to to the car
it feels like a very different kind of
shift
and what it means to be a living entity
on earth
are you afraid are you excited of this
world i'm afraid if there's a lot more
so i think maybe we'll need to think
about
if we truly get there
just
thinking of
limited resources like you know humanity
clearly hits some limits and then
there's some balance hopefully that
biologically um the planet is imposing
and we we should actually try to get
better at this as we know there's
there's quite a few
you know issues with having too many
people um coexisting in a resource
limited way so for digital entities it's
an interesting question i think such a
limit maybe should exist
but maybe it's going to be imposed by
energy
availability because this also consumes
energy in fact
most systems are
more inefficient than we are in terms of
energy required
but definitely i think as a society
we'll need to
just work together to find
what would be reasonable in terms of
growth or how we coexist if that is
to happen
i am very excited about
obviously the aspects of automation that
make people that obviously don't have
access to certain resources or knowledge
um
for them to have those that access i
think those are the applications in a
way that i'm most exciting to see
um and to personally work towards yeah
there's going to be significant
improvements in productivity and the
quality of life across the whole
population which is very interesting but
i'm looking even far beyond
us becoming a multi-planetary species
and uh just as a quick bet last question
do you think
as humans become multi-planetary species
go outside our solar system
all that kind of stuff do you think
there will be more humans or more robots
in that future world so will humans be
the quirky
uh intelligent being of the past or is
there something deeply fundamental to
human intelligence that's truly special
where we we will be part of those other
planets not just ai systems
i think we'll we're all excited to
build
agi to
empower
or make us more powerful as human
species not to say there might be some
hybridization i mean this is obviously
speculation but there are companies also
trying to
um the same way medicine is is making us
better maybe there are other other
things that are yet to happen on that
but
if the ratio is not at most one to one i
would not be happy so i would hope that
we are part of the equation um but maybe
there's
maybe a one-to-one ratio feels like
possible um constructive and so on but
it would not be good to have a
misbalance at least from my core beliefs
and
the why i'm doing what i'm doing when i
go to work and i research what i
research
well this is how i know you're human and
this is how you've passed the turing
test
and you are one of the special humans or
it's a huge honor that you have talked
with me and i hope we get the chance to
speak again maybe once before the
singularity once after and see how our
view of the world changes thank you
again for talking today thank you for
the amazing work you do you're a
shining example of a researcher and a
human being in this community thanks a
lot lex yeah looking forward to before
the singularity certainly
and maybe after
thanks for listening to this
conversation with arielle vignealis to
support this podcast please check out
our sponsors in the description and now
let me leave you with some words from
alan turing
those who can imagine anything can
create the impossible
thank you for listening and hope to see
you next time