MIT AGI: Building machines that see, learn, and think like people (Josh Tenenbaum)

7ROelYvo8f0 • 2018-02-08

Transcript preview

Open

Kind: captions
Language: en
today we have Josh Tenenbaum he's a
professor here at MIT leading the
computational cognitive science group
among many other topics and cognition
and intelligence he is fascinated with
the question of how human beings learn
so much from so little and how these
insights can lead to build AI systems
that are much more efficient at learning
from data so please give Josh a warm
welcome all right thank you very much
thanks for having me decided to be part
of what looks like really quite a very
impressive lineup especially starting
after today and it's I think quite a
great opportunity to get to see
perspectives on artificial intelligence
from many of the leaders in industry and
other entities working on this this
great quest so I'm going to talk to you
about some of the work that we do in our
group but also I'm gonna try to give a
broader perspective reflective of a
number of MIT faculty especially those
who are affiliated with the Center for
brains minds and machines so you can see
up there on my affiliation academically
I'm part of brain and cognitive science
or course nine I'm also part of csail
but I'm also part of the Center for
brains minds and machines which is an
NSF funded Center Science and Technology
Center which really stands for the
bridge between the science and the
engineering of intelligence
it literally straddles Vassar Street and
that we have csail and DCs members we
also have partners at Harvard and other
academic institutions and again what we
stand for I want to try to convey some
of the specific things we're doing in
the center and where we want to go with
a vision that really is about jointly
pursuing the science the basic science
of how intelligence arises in the human
mind and brain and also the engineering
enterprise of how to build something
increasingly like human intelligence in
machines and we deeply believe that
these two projects have something to do
with each other and our best pursued
jointly now it's really exciting time to
be doing anything related to
intelligence or certainly to AI for all
the reasons that you know brought you
all here I don't have to tell you this
we have all these ways in which AI is
kind of finally here we finally live in
the era of something like real practical
AI
or for those who've been around for a
while and have seen some of the rises
and falls you know AI is back in a big
way but from my perspective and I think
maybe this reflects you know why we
distinguish what we might call a GI from
AI we we don't really have any real AI
basically we have what I like to call AI
technologies which are systems that do
things we used to think that only humans
could do and now we have machines that
do them often quite well maybe even
better than any human who's ever lived
right like a machine that plays go but
none of these systems I would say are
truly intelligent none of them have
anything like common sense none of them
have anything like the flexible
general-purpose intelligence that each
of you might use to learn every one of
these skills or tasks right each of
these systems had to be built by large
teams of engineers working together
often for a number of years out often at
great cost to somebody who's willing to
pay for it and each of them just does
one thing so alphago might beat the
worlds best but it can't drive to the
match or even tell you that go it what
go is it can't even tell you the go is a
game because it doesn't even know what a
game is right so what's missing why what
what is it that makes every one of your
brains maybe you can't beat you know the
world's best didn't go but any one of
you can get behind the wheel of a car I
think of this because my daughter is
gonna turn 16 tomorrow if she lived in
California she'd have a driver's license
it's a little bit down the line for us
here in Massachusetts but you know she
didn't have to be specially engineered
by billion dollar startups and you know
she got really into chess recently and
now she's taught herself chess by
playing just you know a handful of games
basically I mean she can do any one of
these activities and any one of us can
so what is it what's that what makes up
the difference well there's many things
right I'll talk about the the focus for
us and our research and a lot of us
again in CBMM is summarized here um what
what drives the success is right now in
AI especially in industry okay and all
these AI technologies is many many
things many things but what's what where
the progress has been made most recently
and what's getting most of the attention
is of course deep learning but other
kinds of machine learning technologies
which essentially represent the
maturation of a decades-long
for to solve the problem of pattern
recognition that means taking data and
finding patterns in the data that tells
you something you care about like how to
label a class or how to predict some
other signal okay
and pattern recognition is great it's an
important part of intelligence and it's
reasonable to say the deep learning as a
technology has really made great strides
on pattern recognition and maybe even
you know has coming close to solving the
problems of pattern recognition but
intelligence is about many other things
intelligence is about a lot more in
particular it's about modeling the world
and think about all the activities that
a human does so model the world that
that go beyond just say recognizing
patterns and data but actually trying to
explain and understand what we see for
instance okay or to be able to imagine
things that we've never seen that never
seen maybe even very different from
anything we've ever seen but might want
to see and then to meet to set those as
goals to make plans and solve problems
needed to make those things real or
thinking about learning again the you
know some kinds of learning can be
thought of as pattern recognition if
you're learning sufficient statistics or
weights in a neural net that are used
for those purposes but many activities
of learning are about building out new
models right either refining reusing
improving old models or actually
building fundamentally new models as
you've experienced more of the world and
then think about sharing our models
communicating our models to others
modeling their models learning from them
all these activities of modeling these
are at the heart of human intelligence
and it requires a much broader set of
tools so I want to talk about the ways
we're studying these activities of
modeling the world and something in a
pretty non-technical way about what are
the kind of tools that allow us to
capture these abilities now I think it's
I want to be very honest up front and to
say this is just the beginning of a
story right when you look at deep
learning successes that itself is a
story that goes back decades I'll say a
little bit about that history in a
minute but where we are now is just
looking forward to a future when we
might be able to capture these abilities
you know at a really mature engineering
scale and I would say we are far from
being able to capture the all the ways
in which humans richly flexibly quickly
build models of the world at the kind of
scale that say Silicon Valley wants
either big tech companies like Google or
soft or IBM or Facebook or small
startups right we can get there and I
think what what I want to talk to you
about here is one route for trying to
get there and this is the route that
CBMM stands for the idea that by reverse
engineering how intelligence works in
the human mind and brain that will give
us a route to engineering these
abilities in machines when we say
reverse engineering we're talking about
science but doing science like engineers
this is our fundamental principle that
if we approach cognitive science and
neuroscience like an engineer where so
the output of our science isn't just a
description of the brain or the mind in
words but in the same terms that an
engineer would use to build an
intelligence system then that will be
both the basis for a much more rigorous
and deeply insightful science but also
direct translation of those insights
into engineering applications
now I said before I talk a little about
history what I mean by that is is this
again if if part of what brought you
here is deep learning and I know even if
you've never heard of deep learning
before which I'm sure is unlikely you
saw some you know a good spectrum of
that in the in the overview session last
night okay it's really interesting and
important to look back on the history of
where did techniques for deep learning
come from or reinforcement learning
those are the two tools in the in the
current machine learning arsenal that
are getting the most attention things
like back propagation or end to end
stochastic gradient descent or temporal
difference learning or cue learning
here's a few papers from the literature
you know maybe some of you have read
these original papers here's here's the
original paper by rumelhart Hinton and
colleagues in which they introduced the
back propagation algorithm for training
multi-layer perceptrons right
multi-layer neural networks here's the
original perceptron paper by Rosenblatt
which introduced the one layer version
of that architecture and the basic
perceptron learning algorithm here's the
first paper on sort of the temporal
difference learning method for
reinforcement learning from Sutton and
Bartow here's the original Bolton
machine paper also by Hinton and
colleagues which you know again is a
those you don't know that architecture
they give a kind of probabilistic
undirected multi-layer perceptron or for
example before there were LS TMS if you
know about current recurrent neural
network architecture earlier as much
simpler versions of the same idea were
proposed by Jeff Elman and his simple
recurrent networks the reason I want to
put up the original papers here
for you to look at both when they were
published and where they were published
so if you look at the dates you'll see
papers going back to you know the the
80s but even the 60s or even the 1950s
and look at where they were published
most of them were published in
psychology journals so the journal
psychological review if you don't know
it is like the leading journal of
theoretical psychology and mathematical
psychology okay or cognitive science the
Journal of the cognitive science Society
or the the backdrop paper was published
in Nature which is a general interest
science journal but by people who are
mostly affiliated with an Institute for
cognitive science in San Diego so what
you see here is already a long history
of scientists thinking like engineers
these are people who are in psychology
or cognitive science departments and
publishing in those places but by
formalizing even very basic insights
about how humans might learn or how you
know brains might learn in the right
kind of math that led to of course
progress on the science side but it led
to all the engineering that we see now
it wasn't sufficient right we needed we
needed of course lots of innovations and
advances in computing hardware and
software systems right but this is where
the basic the basic math came from and
it came from doing science like an
engineer so what I want to talk about in
our vision is what is the future of this
look like if we were to look 50 years
into the future what would we be looking
back on now or you know over this time
scale well here's that here's a
long-term research roadmap that reflects
some of my ambitions and some of our
centers goals and many others too right
we'd like to be able to address basic
questions fundamental questions of what
it is to be and to think like a human
questions for example of consciousness
or meaning in language or real learning
right questions like you know even
beyond the individual like questions of
culture or creativity so our big ideas
up there and for each of these there are
basic scientific questions right how do
we become aware of the world in
ourselves in it starts with perception
but it really turns into awareness
awareness of yourself and of the world
and what we might call consciousness
right or how does a word start to have a
meaning what really is a meaning and how
does a child grasp it or how did
children actually learn what do babies
brains actually start with are they
blank slates or do they start with some
kind of cognitive structure and then
what is real learning look like these
are just some of the questions that were
we're interested in working on
or when we talked about culture we mean
how do you learn all the things you
didn't directly experience right but
that somehow you got from the
accumulation of knowledge in society
over many generations or how do you ever
think of new ideas or answers to new
questions how do you think of the new
questions themselves how do you decide
what to think about these are all key
activities of human intelligence when we
talk about how we model the world where
our models come from what we do with our
models this is what we're talking about
and if we could get machines that could
do these things well again on the bottom
row think of all the actual real
engineering payoffs now in our Center in
both my own activities and a lot of what
my group does these days and what a
number of other colleagues in the Center
for brains minds and machines do as well
as you know brought very broadly people
in VCS and csail one place where we work
on the beginnings of these problems in
the near term this is the long term like
think 50 years okay maybe short or maybe
longer I don't know but think well
beyond well beyond 10 years but in the
short term 5 to 10 years a lot of our
focus is around visual intelligence and
there's many reasons for that again we
can build on the successes of deep
networks and a lot of pattern
recognition and machine vision it's a
good way to put these ideas into
practice when we when we look at the
actual brain the visual system in the
brain in the human and other mammalian
brains for example is really very
clearly the best understood part of the
brain and at a circuit level it's the
part of the brain that's most inspired
current deep learning and neural network
systems but even there there's things
which we still don't really understand
like engineers so here's an example of a
basic problem in visual intelligence
that we and others in the centre are
trying to solve look around you and you
feel like there's a whole world around
you and there is a whole world around
you feel like your brain captures it but
what what the actual sense data that's
coming in through your eyes looks more
like this photograph here where you can
see there's a crowd scene but it's
mostly blurry except for a small region
of high resolution in the center so that
corresponds biologically to what part of
the images in your fovea that's the
central region of cells in the retina
where you have really high-resolution
visual data the size of your phobia is
roughly like if you hold out your thumb
at arm's length it's a little bit bigger
than that but not much bigger right
most of the image in terms of the actual
information coming in and a bottom-up
sense to your brain is really quite
blurry
but somehow by looking at just one part
and then by secada around or making a
few eye movements you get a few glimpses
each not much bigger than the size of
your thumb at arm's length
somehow you stitch that information
together into what feels like and really
is a rich representation of the whole
world around you and when I say around
you I mean literally around you so
here's another kind of demonstration um
without turning around nobody's allowed
to turn around ask yourself what's
behind you now the answer is going to be
different for different people depending
on where you're sitting right for most
of you you might think well there's I
think there's a person pretty close
behind me all right you know you're in a
crowded auditorium although you haven't
seen that person you know that they're
there right for people in the very back
row you know there isn't a person behind
you and you're conscious of being in the
back row right you might be conscious
that there's a wall right behind you but
now for the people who are in the room
not in the very back think about how far
behind you is the back like where's the
nearest wall behind you so we can get
maybe we can call out try a little
demonstration so I don't know I'm
pointing to someone there can you see
phrase say something if you think I'm
pointing at you well I could have been
pointing at you but I'm pointing someone
behind you okay I'll point to you yeah
I'm pointing to you all right
so how far is the nearest wall no you
can't turn around you've blown your
chance right without turning around okay
so you you were laughs okay do you see
I'm pointing to you there with the tie
okay so without turning around how far
is the nearest wall behind you that's
sorry how far five meters okay well I
mean that might be about right no other
people can turn around how about you how
far is the nearest wall behind you
ten meters okay that might be right yeah
how about here
how what do you think twenty okay see
yeah since I didn't grow up in the
metric system I barely know but yeah I
mean I mean the point is that like
you're you're you each of you is is not
surely not exactly right but you're
certainly within an order of magnitude
and I guess if we actually tried to
measure you know you're probably my
guess is you're probably right within
you know fifty percent or less often you
know maybe just twenty percent error
okay so how do you know this I mean even
if it's not what did you say twenty
meters even if it's not twenty meters
it's probably closer to 20 meters than
it is to 5 or 10 meters and then it is
250 meters so how do you know this you
haven't turned around in a while right
but some part of your brain is tracking
the whole world around you right and how
many people are behind you yeah like a
few hundred right I mean I don't know if
it's 200 or 300 or but it's not a
thousand I mean I don't think so and
it's certainly not ten or 20 or 50 right
so you track these things and you use
them to plan your actions
okay so again think about how instantly
effortlessly and very reliably okay your
brain computes all these things so the
people and objects around you and it's
not just you know approximations
certainly when we're talking about
what's what's behind you in space
there's a lot of imprecision but when it
comes to reaching for things right in
front of you
very precise shape and physical property
estimates needed to pick up and
manipulate objects and then when it
comes to people it's not just the
existence of the people but something
about what's in their head right you
track whether someone's paying attention
to you and you're talking to them what
they might want from you what they might
be thinking about you what they might be
thinking about other people okay so when
we talk about visual intelligence this
is the whole stuff we're talking about
and you can start to see how it turns
into basic questions I think of not of
what we might call the beginnings of
consciousness at least our awareness of
ourself in the world and of ourselves as
a self in the world but also other
aspects of higher-level intelligence and
cognition that are not just about
perception like symbols right to
describe even to ourselves what's around
us and where we are and what we can do
with it
you have to go beyond just what we would
normally call the stuff of perception to
say the thoughts in somebody's head and
your own thoughts about that okay so
what we've been doing in CBMM is trying
to develop an architecture for visual
intelligence and I'm not going to go
into any of the details of how this
works and this is just notional this is
just a picture it's like a just a sketch
from a grant proposal of what we say we
want to do but it's based on a lot of
scientific understanding of how the
brain works there are different parts of
the brain that correspond to these
different modules in our architecture as
well as some kind of emerging
engineering way to try to capture at the
software and maybe even hardware levels
how these modules might work so we talk
about a sort of an early module of a
visual or perceptual stream which
like bottom-up visual or other
perceptual input that's the kind of
thing that is pretty close to what we
currently have and say deep
convolutional neural networks but then
we talk about some kind of the output of
that isn't just pattern class labels but
what we call the cognitive core core
cognition so we get an understanding of
space and objects there physics
other people their minds that's the real
stuff of cognition that has to be the
output of perception but somehow we have
to we have we have to have this is what
we call the brain OS in this picture we
have to get there by stitching together
the bottom-up inputs from glimpse here a
glimpse here a little bit here and there
and accessing prior knowledge that comes
from our memory systems to tell us how
to stitch these things together into the
really core cognitive representations of
what's out there in the world and then
if we're going to start to talk about it
in language or to build plans on top of
what we have seen and understood that's
where we talk about symbols coming into
the picture ok the building blocks of
language and plans and so on so now we
might say well ok this is an
architecture that is brain inspired and
cognitively inspired and and we're
planning to turn into real engineering
and you can say well do we need that
maybe you know again I know this is a
question you considered in the first
lecture
maybe the engineering toolkit that's
currently been making a lot of progress
in let's say industry maybe that's good
enough maybe you know let's take deep
learning but to stand for a broader set
of modern pattern recognition based and
reinforcement learning based tools and
say ok well maybe that can scale up to
this and you might you know it but maybe
that's that's possible I'm happy in the
question period of people want to debate
this my sense is no I think that it's
not when I say no I don't mean like it
can't happen or it won't happen what I
mean is the highest value the highest
expected route right now is to take this
more science-based reverse engineering
approach and that if at least if you
follow the current trajectory that
industry incentives especially optimized
for it's not even really trying to take
us to these things so think about for
example a case study of visual
intelligence that is in some ways as
pattern recognition very much of a
success it's again been mostly driven by
industry it's something that if you read
in the
Jews or even play around with in certain
of it publicly available datasets feels
like we've made great progress and this
is an aspect of visual intelligence
which is sometimes called image
captioning it's bate or mapping images
to text you know basically there's been
a bunch of systems here's a couple of
press releases I guess this one's about
Google Google's AI can now capture
images almost as well as humans
here's ones about Microsoft a couple of
years ago I think there were something
like eight papers all released onto
archive around the same time from
basically all the major industry
computer vision groups as well as a
couple of academic partners okay which
all driven by basically the same data
set produced by some Microsoft
researchers and other collaborators
trained a combination of deep
convolutional neural networks you know
state of the art visual pattern
recognition with recurrent neural
networks which had recently been
developed for you know basically kinds
of neural statistical language modeling
glued them together and produced a
system which which which made very
impressive results in a big training set
and a held-out test set where the goal
was to take an image and write a
sentence like a short sentence caption
that that would seem like the kind of
way a human would describe that image
and these systems you know surpassed
human level accuracy on the held-out
test set from a big training set but
what you can see when you really dig
into these things is there's often a lot
of what I would call data set
overfitting it's not overfitting to the
training set but it's overfitting to
whatever are the particular
characteristics of this data set you
know wherever ever came from certain set
of photographs and certain ways of
captioning them okay which even a big
data set it's not about quantity it's
more about the quality the nature of
what people are doing all right so one
way to test this system is to apply it
to what seems like basically the same
problem but not within the a certain
curated or built data set and there's a
convenient Twitter bot that lets you do
this so there's something called the pic
desk bot which takes one of the state of
the art industry AI captioning systems a
very good one again this is not meant to
I'm not trying to critique these systems
for what they're trying to do I'm just
trying to point out what they don't
really even try to do so this takes the
microsoft caption bot and just every
couple of hours takes a random image
from the web captions it and upload
the results to Twitter and a couple of
months ago when I prepared a first
version of this talk I just took a few
days in the life of this Twitter bot I
didn't take every single image but I
took you know most of the images in a
way that was meant to be representative
of the successes and the kinds of
failures that such a system will make so
we can go through this and it's a little
bit entertaining and I think quite
informative so here's just a somewhat
random sample of a few days in the life
of one of these caption BOTS so here we
have a picture of a person holding for
tonight my screen is very small here and
I can't read up there so maybe you'll
have to tell me was that but a person
holding a cell phone I guess I'll just
read along with you so have a person
holding a cell phone well it's not a
person holding a cell phone but it's
kind of close it's a person holding some
kind of machine so I don't even know
what that is but it's some kind of
musical instrument right
so that's a mixed success or failure
here's some pretty good one a group of
people on a on a field playing football
that's I would call that a you know a
result maybe even A+ here's a group of
people standing on top of a mountain
so less good there's a mountain but as
far as I can tell there's no people but
these systems like to see people because
of both the combination because in the
data set they were trained on there's a
lot of people and people often talk
about people okay I mean and the fact
that you can appreciate both what I said
and why it's funny that's there you did
some of my cognitive activities that
this system is not even trying to do
okay here we've got a building with the
cake I'll go through these fast building
with the cake a large stone building
with the clock tower I think that's
pretty good I'd give that like a b-plus
there's no clock but it's plausibly
right there might be a clock in there
there's definitely something like that
here's a truck parked on the side of a
building I don't know maybe a b-minus
there there is a car on the side of a
building but it's not a truck and it's
and it's it's not doesn't seem like the
main thing in the image okay
here's a necklace made of bananas here's
a large ship in the water this is pretty
good I give this like an a-minus or
b-plus because there is a ship in the
water but it's not very large it's
really more of like a tugboat or
something here's a sign sitting on the
grass you know in some sense that's
great no but it but in another sense
it's really missing what's actually
interesting and important and meaningful
to humans
here's a
here's a garden is in the dirt a pizza
sitting on top of the building a small
house with the red brick building that's
pretty good although a kind of weird way
of saying it a vintage photo of a pond
that's good they like vintage photos a
group of people that are standing in the
grass near a bridge again there's two
people and there's some grass and
there's a bridge but it's really not
what's going on a person in the yard
okay kind of a group of people standing
on top of the boat there's a boat
there's a group of people they're
standing but again it's what the
sentence that you see is is more based
on a bias of what people have said in
the past about images that are only
vaguely like this a clock tower is a
little at night that's really I think
pretty impressive a large clock mounted
to the side of the building a little bit
less so a snow-covered feel very good a
building with snow on the ground a
little bit less good there's no snow
white some people who I don't know them
but I bet that's probably right because
face identifying faces and recognizing
people who are famous because they won
you know medals and the Olympics
probably I would trust current pattern
recognition systems to get that a
painting of a base in front of a mirror
less good also a famous person there but
we didn't get him a person walking in
the rain again there is sort of a person
and there's some puddles but not you
know a group of stuffed animals a car
parked in a parking lot that's good a
car parked in front of a building less
good a plate with a fork and knife a
clear blue sky okay so you get the idea
again like if you actually go and play
with the system partly because I think
Mike but my friends at Microsoft told me
they've improved at some you know I this
is partly for entertainment values you
know I chose what also would be the
funnier example so I'm quite I want to
be quite honest about it and these are
I'm not trying to take away what our
impressive AI technologies but I think
it's clear that there's a sense of
understanding any one of these images
that it's important to see that even
when it seems to be correct right if it
can make the kind of errors that it
makes that even when it seems to be
correct it's probably not doing what
you're doing and it's probably not even
trying to scale towards the dimensions
of intelligence that we think about when
we're talking about human intelligence
okay another way to put this I'm going
to show you a really insightful blog
post from one of your other speakers so
in a couple of days I'm not sure you're
going to have Andre
Karpov a who's one of the leading people
in deep learning this is a really great
blog post he wrote a couple of years ago
when he was I think still at Stanford he
got his PhD from Stanford he did he
worked at Google a little bit on some
early big neural net AI projects there
he was an open AI he was one of the
founders of open AI and recently he
joined Tesla as their director of AI
research but about five years ago he was
looking at the state of computer vision
from a human intelligence point of view
and and lamenting how far away we were
okay so this is the title of his blog
post the state of computer vision
nai-nai we are really really far away
and he took this image which was a sort
of a famous image in its own right it
was a popular image of Obama back when
he was president kind of playing around
as he liked to do when he was on tour so
if you take a look at this you can see
you probably all can recognize the
previous President of the United States
but you can also get the sense of where
he is and what's going on and you might
see people smiling and you might get the
sense that he's playing a joke on
someone can you see that right so how do
you know that he's playing a joke and
what that joke is well as Andre goes on
to talk about in his blog post too if
you think about all the things that that
you have to really deploy in your mind
to understand that it's a huge list of
course it starts with seeing people and
objects and maybe doing some face
recognition but you have to do things
like for example notice his foot on the
scale and understand enough about how
scales work that when a foot presses
down it exerts force that the scale is
sensitive doesn't just magically measure
people's weight but it does that somehow
through force you have to see who can
see that he's doing that and who can't
who cannot see that he's doing that
right in particularly the person on the
scale and why some people can see that
he's doing that and can see that some
other people can't see it why that makes
it funny to them okay and someday we
should have machines that can understand
this but hopefully you can see why what
I would I what the kind of architecture
that I'm talking about would be the
building blocks of the ingredients to be
able to get them to do that now I when I
again I prepared a version of this talk
a few months ago and I wrote to Andre
and I said I was gonna use this and I
was curious if he how what you know if
he had any reflections on this and where
he thought we were relative to five
years ago because a certain
a lot of progress has been made but he
said here's his email I hope he doesn't
mind me sharing it but I mean again he's
a very honest person and that's one of
the many reasons why he's such an
important person right now in AI okay
he's both very technically strong and
honest about what we can do what we
can't do and as he says well what does
he say it's nice to hear from you it's
funny you should bring this up I was
also thinking about writing a a return
to this and in short basically I don't
believe we've made very much progress
right he points out that in his long
list of things that you'd need to
understand the image we have made
progress on some the ability to again
detect people and do face recognition
for well-known individuals okay but
that's kind of about it all right
and he wasn't particularly optimistic
that the current route that's being
pursued an industry is is anywhere close
to solving or even really trying to
solve these larger questions um if we
give this image to that caption bot you
know what we see is again represents the
same point so here's the caption bot it
says I think it's a group of people
standing next to a man in a suit and tie
right so that's right right as far as it
goes it just doesn't go far enough and
the current the current ideas of built a
data set train a deep learning algorithm
on it and then repeat um aren't really
even I would venture trying to get to
what we're talking about or here's
another I'll just give you one other
example of a couple of photographs from
my recent vacation and a nice warm
tropical look how which I think
illustrates ways in which again the gap
where we have machines that can say beat
the world's best at go but can't even
beat a child at tick-tack-toe
now what do I mean by that well you know
of course we can build we don't even
need reinforcement learning or deep
learning to build a machine that can
they can win or tie do is do optimally
in tic-tac-toe but think about this this
is a real tic-tac-toe game which I saw
on the grass outside my hotel right what
do you have to do to look at this and
recognize that it's a tic-tac-toe game
you have to see the objects you have to
see what's you know in some sense
there's a three by three grid but it's
but it's only abstract right it's only
delimited by this these ropes or strings
okay it's not actually a grid in any
simple geometric sense all right but yet
a child can look at that and indeed
here's an actual child who was looking
at it and recognized oh it's a game of
tic-tac-toe and even know what they need
to do to win
we put the X and completed and now
they've got three in a row right that's
that's literally child's play okay
you showed this sort of thing though to
one of these you know image
understanding caption BOTS and I think
it's a close-up of a sign okay again
it's not like saying that this is a
close-up of a sign is is not the same
thing I would venture as a as a
cognitive or computational activity
that's going to give us what we need to
say recognize the objects to recognize
it as a game to understand the goal and
how to plan to achieve those goals
whereas this kind of architecture is
designed to try to do all of these
things ultimately right and I bring in
these examples of games or jokes to
really show where perception goes to
cognition you know that and all the way
up to symbols right so to get objects
and forces and mental states that's the
cognitive core but to be able to get
goals and plans and what do I do or how
do I talk about it that's symbols okay
here's another way into this and it's
one that also motivates I think a lot of
really good work on the engineering side
and a lot of our interest in the science
side is think about robotics and think
about what do you have to do to you know
what is the brain have to be light to
control the body so again you're gonna
hear from shortly I think maybe it's
next week from Mark raybert who's one of
the founders of Boston Dynamics which is
one of my favorite companies anywhere
they're without doubt the leading maker
of humanoid robots legged locomoting
robots in industry they have all sorts
of other really cool robots robots like
dogs robots that have all you know I
think you'll even get to see a live
demonstration of my new robots this
really awesome impressive stuff okay um
but what about the minds and brains of
these robots well again if you ask mark
ask them how much of human-like
cognition do they have in their robots
and I think he would say very little in
fact we have asked him that and he would
say very little he has said very little
he's actually one of the advisors of our
Center and I think in many ways were
very much on the same page we both want
to know how do you build the kind of
intelligence that can control these
bodies like the way a human does alright
um here's another example of an industry
robotics effort this is Google's arm
farm
where you know they've they've got lots
of robot arms and they're trying to
train them to pick up objects using
various kinds of deep learning and
reinforcement learning techniques and I
think it's one approach I just think
it's very very different from the way
humans learn to say control their body
and manipulate objects and you can see
that in terms of things that go back to
what you were saying when you're
introducing me right think about how
quickly we learn things right here you
have these the arm farm is trying to
generate you know effectively maybe if
not infinite but hundreds of thousands
millions of examples of reaches and
pickups of objects even with just a
single gripper and yet a child who in
some ways can't control their body
nearly as well as robots can be
controlled at the low level and is able
to do so much more so I'll show you two
of my favorite videos from YouTube here
which motivate some of the research that
we're doing the one on the left is a one
and a half year old and the other ones a
one year old so just watch this one and
a half year old here doing a popular
activity for many kids as a playing hmm
you see video up there I'd okay there we
go okay so he's he's on doing this
stacking Cup activity alright he's
stacking up cups to make a tall tower
he's got a stack of three and what you
can see for the first part of this video
is it looks like he's trying to make a
second stack and that he's trying to
pick up at once basically he's trying to
make a stack of two that'll go on the
stack of three and you know he's trying
to debug his plan because it's it got a
little bit stuck here but and think
about I mean again if you know anything
about robots manipulating objects even
just what he just did no robot can
decide to do that and actually do it
right at some point he's almost got it
it's a little bit tricky but at some
point he's gonna get that stack of two
he realizes he has to move that object
out of the way look at what he just did
move it out of the way use two hands to
pick it up and now he's got a stack of
two on a stack of three and suddenly you
know subgoal completed he's now got a
stack of five and he gives himself a
hand because he know he knows he
accomplished a keyway point along the
way to his final goal that's a kind of
early symbolic cognition right to
understand that I'm trying to build a
tall tower but a tower is made up of
little towers it's you know it can end
and you can take a tower and put it on
top of another tower or stack a stack on
us
a can you have a bigger stack right so
think about how he goes from bottom up
perception to the objects of the physics
needed to manipulate the objects to the
ability to make even those early kinds
of symbolic plans at some point he keeps
doing this he puts another stack on
there I'll just jump to the end
oops sorry you missed it so he he gets
really excited and he gives himself
another big hand but falls over okay
again Boston Dynamics now has robots
that could pick themselves up after that
that's really impressive again but all
the other stuff to get to that point we
don't really know how to do in a robotic
setting or think about this baby here
this is a younger baby this is one of
the Internet's very most popular videos
because it features a baby and a cat and
but the babies doing something
interesting he's got the same cups but
he's decided he's again decided to try a
new thing so this think about creativity
he's decided that his goal is to stack
up cups on the back of a cat I guess
he's asking how many cups can I fit on
the back of a cat well three let's see
can I fit more let's try another one
okay well he can't fit more than three
it turns out and then he then does it's
not working so he changes his goal now
his goal appears to be to get the cups
on the other side of the cat now watch
that part when he reaches back behind
him there that's I'll just pause it
there for a moment
umm someone he just reached back there
that's a particularly striking moment in
the video it shows a very strong form of
what we call in cognitive science object
permanence okay that's the idea that you
represent objects as these permanent
enduring entities in the world even when
you can't see them in this case he
hadn't seen or touched that object
behind him for like at least a minute
right maybe much longer I don't know and
yet he still knew it was there and he
was able to incorporate it in his plan
right there's a moment before that when
he's about to reach for it but then he
sees this other one right and it's only
when he's now exhausted all the other
objects here that he can see he's like
okay now time to get this object and
bring it into play right so think about
what has to be going on in his brain for
him to be able to do that right that's
like the analog of you understanding
what's behind you okay um it's not that
these things are impossible to capture
machines far from it it's just that like
training a deep neural network or any
kind of pattern recognition system we
don't think is going to do it but we
think by reverse engineering how it
works in the brain
we might be able to do it I think we can
can do it okay it's not just humans that
do this kind of activity here's a couple
of again rather famous videos you can
watch all of these on YouTube
crows are famous object manipulators and
tool users but also orangutangs other
primates rodents we can watch if we just
hey let me pause this one for a second
if we watch this orangutan here he's got
a bunch of big legos and over the course
of this video he's building up a stack
legos it's really quite impressive
you're just jumping to the end there's
actually some controversy out there of
whether this video is a fake but the
controversy isn't about you know it's
not like whether it was I don't know
dumb with computer animation some people
think the video was actually filmed
backwards that a human built up the
stack and the orangutan just slowly
disassembled it piece by piece and it
turns out it's remarkably hard to tell
whether it's played forward or backwards
in time and people have argued over
little details because you know it would
be quite impressive if an orangutan
actually was able to build up this
really impressive stack of Legos but I
would submit that it would be almost as
impressive if he disassembled it think
about the activity I mean if I wanted to
disassemble that the easiest thing to do
would just be to knock it over
that's really all most robots could do
but to piece by piece disassemble it
even if it's played backwards like this
that's still a really impressive act of
symbolic planning on physical objects or
here you've got this this famous Mouse
this you can find on the internet under
the mouse versus cracker video and what
you'll see here over the course of this
video is a mouse valiantly and mostly
hopelessly struggling with a cracker
that they're hoping to bring back to
their nest I guess it's a very appealing
big meal and at some point after just
trying to get it over the over the wall
at some point the mouse just gives up
because it's just never gonna happen and
he just goes away except that because
even Mouse's can dream or mice can dream
some point he decides okay I'm just
gonna come out for one more try and he
tries one more time and this time
valiantly gets it over yeah isn't that
very impressive congratulations guys
okay you don't have to clap form you can
clap for me at the end or clap for
whoever later okay but I want to applaud
the mouse there every time I see that
okay but again think what had to be
going on in his brain
able to do that all right it's a crazy
thing and yet he formulated the goal and
was able to achieve it I'll just show
one more video that is really more about
science these other ones are you know
some of them actually were from
scientific experiments but this is one
that motivates a lot of the science that
I do and it's to me it sets up kind of a
grand cognitive science challenge for AI
and robotics it's from an experiment
with humans again eighteen month olds or
one-and-a-half year old so the the kids
in this experiment were the same age is
the first baby I showed you the one who
did the stacking and 18 months is really
a very very good age to study if you're
interested in intelligence for reasons
we can talk about later if you're
interested this is from a very famous
experiment done by two psychologists
Felix Warren akin and Michael Tomasello
and it was studying the spontaneous
helping behavior of young children it
also contrasted humans and chimps and
the punchline is that chips sometimes do
things that are kind of like what this
human did but not nearly as reliably or
as flexibly okay so not nearly it is and
I'll show you a particular kind of
unusual situation where human kids had
relatively little trouble figuring out
kind of what to do or even whether they
should do it whereas basically no chimp
did what you're gonna see humans
sometimes doing here so the experimenter
in this movie I'll turn on the sound
here if you can hear it the experimenter
is the tall guy and the participant is
the little kid in the corner there there
there's sound but no words right and at
some point he stops and then the kid
just does whatever they want to do so
watch what he does he goes over he opens
the cabinet looks inside then he steps
back and he looks up at felix and then
looks down okay and then the action is
completed now well wonder I want you to
watch it one more time and think about
what's gotta be going inside the kid's
head to understand this to understand
like so it seems like what it looks like
to us is the kid figured out that this
guy needed help and helped him and the
paper is full of many other situations
like this this is just one OK but the
key idea is that the situation is
somewhat novel people have seen people
holding books and opening cabinets but
probably it's very rare to see this kind
of situation exactly right it's
different in some important details from
what you might have seen before and
there's other ones in there that are
really truly novel because they just
made up a machine right there
okay but somehow he has to understand
causally from the way the guy's banging
the books against the thing that it's
it's sort it's sort of both a symbol but
it's also somehow he's got to understand
what he can do and what he can't do and
then what the kid can do to help and
I'll show this again but really just
watch the main part I want you to see is
I'll just sort of skip ahead so watch
this part here let's say I'll just jump
right when he watch right now he's about
to look up he looks up and makes eye
contact and then his eyes look down so
again he looks up he looks up and then a
saccade a sudden rapid eye movement down
down to his hands up down okay so that's
again that's this brain OS in action
right he's making one glance small
glance at the big guy's eyes just to
make eye contact to see to get a signal
did I understand what you wanted and did
you did you register that joint
attention and then he makes a prediction
about what the guy's gonna do so he
looks right down he doesn't just like
look around randomly he looks right down
to the guy's hands to track the action
that he expects to see happening if I
did the right thing to help you then I
expect you're gonna put the books there
okay so you can see these things
happening and we want to know what's
going on inside the mind that guides all
of that all right so that's the sort of
big scientific agenda that we're working
on over the next few years where we
think some kind of human understanding
of human intelligence in scientific
terms could lead to all sorts of AI
payoffs in particular suppose we could
build a robot that could do what this
kid and many other kids and these
experiments do just say help you out
around the house without having to be
programmed or even really instructed
just to kind of get a sense oh yeah you
need to have at that shirt let me help
you out okay even 18 month olds will do
that sometimes

Resume

Berikut adalah rangkuman komprehensif dan terstruktur dari transkrip video yang diberikan.

***

# Membangun Kecerdasan Buatan Seperti Manusia: Pendekatan Sains Kognitif dan Probabilistik

### Inti Sari (Executive Summary)
Dalam presentasi ini, Profesor Josh Tenenbaum dari MIT membahas keterbatasan kecerdasan buatan (AI) modern yang mengandalkan *deep learning* dan pengenalan pola, serta membandingkannya dengan kemampuan kognitif manusia yang mampu belajar dari sedikit data. Ia mengusulkan pendekatan "reverse engineering" terhadap kecerdasan manusia melalui kerangka kerja "Probabilistic Programs" dan konsep "Game Engine in the Head". Tujuannya adalah untuk membangun mesin yang tidak hanya melihat, tetapi juga memahami dunia, memiliki akal sehat (*common sense*), dan mampu merencanakan tindakan layaknya manusia.

### Poin-Poin Kunci (Key Takeaways)
*   **Kesenjangan AI vs. Manusia:** AI saat ini unggul dalam pengenalan pola (seperti AlphaGo), namun gagal dalam fleksibilitas, pemahaman akal sehat, dan kemampuan belajar cepat dari sedikit data yang dimiliki manusia.
*   **Arsitektur Otak:** CBMM (Center for Brains, Minds and Machines) mengembangkan arsitektur visual yang meniru otak: dari aliran visual bawah-atas (mirip CNN) menuju inti kognitif yang memahami fisika, ruang, dan pikiran orang lain.
*   **Game Engine in the Head:** Otak manusia beroperasi seperti mesin fisika permainan (*game physics engine*) yang mensimulasikan dunia secara instan untuk memprediksi hasil dan memahami sebab-akibat.
*   **Belajar sebagai Pemrograman:** Pembelajaran pada anak-anak dapat dianalogikan sebagai aktivitas *hacking* atau pemrograman, di mana mereka menyusun kembali model mental mereka berdasarkan pengalaman.
*   **Probabilistic Programs:** Teknologi kunci yang menggabungkan kekuatan representasi simbolik, penalaran probabilistik, dan *deep learning* untuk mencapai pemahaman yang lebih dalam.
*   **Kolaborasi Industri & Akademi:** Untuk mencapai AGI (Artificial General Intelligence), diperlukan sinergi antara riset jangka panjang akademi dan kekuatan komputasi jangka pendek industri.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Keterbatasan AI Saat Ini dan Visi CBMM
Josh Tenenbaum membuka pembahasan dengan menyoroti bahwa meskipun AI praktis telah tiba, sistem saat ini masih merupakan "teknologi AI" yang sempit. Mereka dapat melakukan satu hal dengan sangat baik (misalnya bermain Go atau mengenali wajah), tetapi tidak memiliki kecerdasan umum atau akal sehat. Sebagai contoh, AlphaGo tidak bisa mengemudi atau menjelaskan apa itu permainan Go. Sebaliknya, manusia (seperti remaja berusia 16 tahun) dapat mempelajari banyak keterampilan dengan cepat tanpa rekayasa khusus.

*   **Tujuan CBMM:** Membangun jembatan antara sains dan rekayasa kecerdasan, dengan fokus pada *reverse engineering* kecerdasan manusia untuk membuat mesin yang lebih mirip manusia.
*   **Sejarah AI:** Tenenbaum mengingatkan bahwa *neural networks* modern sebenarnya berakar pada psikologi kognitif, dengan makalah-makalah fundamental diterbitkan di jurnal psikologi sebelum berkembang di bidang teknik.

#### 2. Visual Intelligence dan Arsitektur Kognitif
Fokus jangka pendek penelitian adalah pada kecerdasan visual. Sistem visual manusia sangat efisien; kita merasa melihat seluruh dunia, padahal data sensorik kita terbatas (hanya area fovea yang tajam). Otak secara instan menghitung informasi spasial, fisika, dan bahkan niat orang lain.

*   **Arsitektur yang Diusulkan:**
    *   **Modul Visual Awal:** Menerima input visual (mirip *Convolutional Neural Networks*).
    *   **Inti Kognitif (Cognitive Core):** Mengembangkan pemahaman tentang ruang, objek, fisika, orang lain, dan pikiran mereka.
    *   **Brain OS:** Menyatukan input bawah-atas dengan pengetahuan sebelumnya dari memori, menggunakan simbol untuk bahasa dan perencanaan.
*   **Kritik pada AI Captioning:** Tenenbaum menunjukkan contoh kegagalan AI dalam mendeskripsikan gambar (misalnya mengira mobil sebagai truk atau gagal memahami konteks emosional), membuktikan bahwa AI saat ini hanya mengenali piksel, bukan memahami makna.

#### 3. Intuisi Fisika dan Psikologi pada Bayi
Penelitian menunjukkan bahwa bayi memiliki "mesin" intuitif untuk fisika dan psikologi. Melalui eksperimen *looking time* (waktu melihat), bayi menunjukkan kejutan ketika hukum fisika dilanggar atau ketika karakter animasi melakukan tindakan yang tidak efisien.

*   **Naive Utility Calculus:** Bayi sensitif terhadap "kerja fisik" yang dilakukan agen untuk mencapai tujuan. Semakin besar usaha yang dilakukan, semakin besar keinginan yang disimpulkan bayi.
*   **Inverse Planning:** Model komputasi menggunakan *physics engine* (seperti MuJoCo) untuk merencanakan gerakan paling efisien. Melalui inferensi Bayesian, sistem dapat menebak tujuan seseorang hanya dengan melihat gerakan tubuhnya, sama persis seperti cara manusia membaca niat.

#### 4. Belajar sebagai Pemrograman (Learning as Programming)
Jika pengetahuan adalah program (seperti *game engine* di kepala), maka belajar adalah "program learning" atau pembelajaran cara memprogram.

*   **Anak sebagai Hacker:** Tenenbaum menggambarkan anak-anak sebagai peretas yang terus-menerus menyempurnakan "kode" di otak mereka agar lebih akurat, cepat, dan efisien.
*   **One-Shot Learning:** Penelitian yang dipublikasikan di *Science* (2015) menunjukkan sistem yang mampu belajar konsep visual baru (seperti karakter tulisan tangan dari alfabet asing) hanya dari satu contoh. Sistem ini menggunakan *Bayesian Program Learning*, di mana melihat adalah proses membalikkan program probabilistik untuk menemukan kode yang menghasilkan gambar tersebut. Hasilnya, manusia tidak bisa membedakan gambar yang dibuat mesin dengan buatan manusia (*Visual Turing Test*).

#### 5. Masa Depan AI, Emosi, dan Kolaborasi
Bagian penutup membahas masa depan AI dan tantangan rekayasa.

*   **Program Synthesis:** Menggabungkan alat bahasa pemrograman dengan pembelajaran mesin untuk menemukan program terpendek yang menangkap dataset.
*   **Peran Emosi:** Emosi bukan hanya gangguan, melainkan bagian penting dari model mental kita tentang diri sendiri dan situasi. Memahami emosi memerlukan penalaran *counterfactual* (bayangan "bagaimana jika").
*   **Tantangan Hardware:** Konsumsi daya adalah hambatan besar. Otak manusia sangat efisien. Peneliti seperti Joe Bates (Singular Computing) bekerja pada komputasi *brain-inspired* yang hemat daya.
*   **Industri vs. Akademi:** Industri fokus pada nilai jangka pendek (2-5 tahun), sedangkan akademi fokus pada sains jangka panjang. Kedua pihak saling membutuhkan: akademi memberikan pemahaman tentang arsitektur kognitif, sementara industri menyediakan sumber daya dan tantangan dunia nyata.

### Kesimpulan & Pesan Penutup
Josh Tenenbaum menutup dengan menegaskan bahwa untuk mencapai kecerdasan umum buatan (AGI), kita tidak bisa melompati tingkat pemahaman kognitif dan langsung beralih ke sirkuit saraf (*neural circuits*). Kita harus memahami "perangkat lunak" atau program yang dijalankan otak. Dengan menggabungkan wawasan dari psikologi perkembangan, ilmu saraf, dan ilmu komputer (melalui *Probabilistic Programs*), kita dapat membangun mesin yang tidak hanya cerdas secara statistik, tetapi juga memiliki pemahaman yang mendalam tentang dunia layaknya manusia. Kolaborasi kreatif antara akademi dan industri adalah kunci untuk mewujudkan visi ini.

Read

file updated 2026-02-13 13:22:53 UTC