Pieter Abbeel: Deep Reinforcement Learning

File TXT tidak ditemukan.

Pieter Abbeel: Deep Reinforcement Learning | Lex Fridman Podcast #10

l-mYLq6eZPY • 2018-12-16

Transcript preview

Open

Kind: captions
Language: en
the following is a conversation with
Petera Beal he's a professor UC Berkeley
and the director of the Berkeley
robotics learning lab he's one of the
top researchers in the world working on
how we make robots understand and
interact with the world around them
especially using imitation and deeper
enforcement learning this conversation
is part of the MIT course and artificial
general intelligence and the artificial
intelligence podcast if you enjoy it
please subscribe on YouTube iTunes where
your podcast provider of choice or
simply connect with me on Twitter at Lex
Friedman spelled Fri D and now here's my
conversation with Peter a Biel you've
mentioned that if there was one person
you could meet you'll be Roger Federer
so let me ask when do you think we will
have a robot that fully autonomously can
beat Roger Federer at tennis Roger
Federer level player at tennis huh well
first if you can make it happen for me
to meet Roger let me know terms of
getting a robot to beat him at tennis
it's kind of an interesting question
because for a lot of the challenges we
think about in AI the software is really
the missing piece but for something like
this the hardware is nowhere near either
like to really have a robot that can
physically run around the Boston
Dynamics robots are starting to get
there but still not really human level
ability to to run around and then swing
a racket that's a hardware problem I
don't think it's a harder problem only I
think it's a hardware and a software
problem I think it's both and I think
they'll they'll have independent
progress so I'd say the the hardware
maybe in 10-15 years I'm just late not
grass I've dressed with a sliding yeah
oh plague I'm not sure what's Carter
grass or clay the clay involves sliding
which might be harder to master actually
yeah but you're not limited to bipedal I
mean I'm sure there's I can build a
machine it's a whole different question
of course you know
you can if you can say okay this robot
can be on wheels they can move around on
wheels and can be designed differently
then I think that that can be done
sooner probably than a full humanoid
type of setup what do you think is swing
a racket so you've worked at basic
manipulation how hard do you think is
the task of swinging or racket would be
able to hit a nice backhand or a
forehand
okay let's say let's say we just set up
stationary a nice robot arm let's say
you know a standard industrial arm and
it can wash the ball come and then swing
the racket it's a good question I'm not
sure it would be super hard to do I mean
I'm sure it would require a lot if we do
it breed with reinforced Maleny would
require a lot of trial and error it's
not gonna swing it right the first time
around but yeah I don't I don't see why
I couldn't see the right way I think
it's learn about I think if you set up a
ball machine let's say on one side and
then a robot with a tennis racket on the
other side
I think it's learn about and maybe a
little bit of pre training and
simulation yeah I think that's I think
that's feasible I think I think the
swinging the racket is feasible I'd be
very interesting to see how much
precision it can get
listen I mean that's that's where I mean
some of the human players can hit it on
the lines which is very high precision
with spin this win is it is an
interesting whether RL can learn to put
a spin on the ball well you got me
interested maybe someday we'll set this
is your answer is basically okay for
this problem it sounds fascinating but
for the general problem of a tennis
player we might be a little bit farther
away what's the most impressive thing
you've seen a robot do in the physical
world so physically for me it's the
Boston Dynamics videos always just ring
home and just super impressed
recently the robot running up the stairs
doing the parkour type thing
I mean yes we don't know what's
underneath they don't really write a lot
of detail but even if it's hard coded
underneath which you might or might not
be just the physical abilities of doing
that parkour that's a very impressive so
a lot right there have you met spot many
or any of those robots in person might
spot mini last hearing in April at the
Mars event that Jeff Bezos organizes
they brought it out there and it was
nicely falling around Jeff when Jeff
left the room they had it
follow him along which is pretty
impressive so I think there's some
confidence to know that there's no
learning going on in those robots the
psychology of it so while knowing that
while knowing there's not if there's any
learning going on it's very limited
I met spot Minnie earlier this year and
knowing everything that's going on
having one-on-one interaction so I got
to spend some time alone and there's a
immediately a deep connection on the
psychological level even though you know
the fundamentals how it works there's
something magical so do you think about
the psychology of interacting with
robots in the physical world even you
just showed me the pr2 the the robot and
and there was a little bit something
like a face head a little bit something
like a face there's something that
immediately draws you to it do you think
about that aspect of
of the robotics problem well it's very
hard with bread here we'll give him a
name Berkeley robot for the elimination
of tedious tasks is very hard to not
think of the robot as a person and it
seems like everybody calls him a he for
whatever reason but that also makes it
more a person than if it was a it and
it's it seems pretty natural to think of
it that way this past weekend really
struck me I've seen pepper many times on
on videos but then I was at an event
organized by this was by fidelity and
they had scripted pepper to help
moderate some sessions and yet scripted
pepper to have the personality of a
child a little bit and it was very hard
to not think of it as its own person in
some sense because it was just kind of
jumping it would just jump into
conversation making it very interactive
moderate will be saying pepper just jump
in hold on how about me can I
participate in this doing it just like I
heard this is like like a person and I
was 100% scripted and even then it was
hard not to have that sense of somehow
there is something there so as we have
robots interact in this physical world
is that a signal that can be used in
reinforcement learning you've you've
worked a little bit in this direction
but do you think that's that psychology
can be somehow pulled in now so that's a
question
I would say a lot a lot of people ask
and I think part of why they ask it is
they're thinking about how unique are we
really still ask people like after they
see some results they see a computer
play go to say computer do this that
they're like ok but can it really have
emotion can it really interact with us
in that way and then once you're around
robots you already start feeling it and
I think that kind of maybe
mythologically the way that I think of
it is if you run something like
reinforce some Linux about optimizing
some objective and there's no reason
that D object couldn't be tied into how
much there's a person like interacting
with this system and why could not the
reinforcement learning system optimized
for their robot being fun to be around
and
why wouldn't it then naturally become
more and more interactive and more and
more maybe like a person or like a pet I
don't know what it would exactly be but
more more have those features and
acquire them automatically as long as
you can formalize an objective of what
it means to like something what how you
exhibit what's the ground truth how do
you how do you get the reward from human
cause you have to somehow collect that
information within you human
but you you're saying if you can
formulate as an objective it can be
learned there is no reason it couldn't
emergent through learning and maybe one
way to formulate has an objective you
wouldn't have to necessarily score it
explicitly so standard rewards are
numbers and numbers are hard to come by
this is a 1.5 or 0.7 on some scale it's
very hard to do for a person but much
easier is for a person to say okay what
you did the last five minutes was much
nicer than we did the previous five
minutes and that now gives a comparison
compare and in fact there have been some
results in that for example Paul
Christiana and collaborators at open e I
had the hopper madoka Hopper one legged
robot the Batman's little back flips
yeah purely from feedback I like this
better than that that's kind of equally
good and after a bunch of interactions
it figured out what it was the person
was asking for it namely a back flip and
so I think the same thing
od wasn't trying to do a back flip it
was just getting a score from the
comparison score from the person based
on hers and having a mind in their own
mind what I wanted to do a back flip but
the robot didn't know what it was
supposed to be doing it just knew that
sometimes the person said this is better
this is worse and then the robot figure
it out what the person was actually
after was a back flip and I'd imagine
the same would be true for things like
more interactive robots that the robot
would figure out over time oh this kind
of thing apparently has appreciated more
than this other kind of thing so when I
first picked up
Sutton's Richard Sutton's reinforcement
learning book before sort of this deep
learning before the re-emergence of
neural networks is a powerful mechanism
for machine learning IRL seemed to me
like magic as a as beautiful so
that seemed like what intelligence is RL
reinforcement learning so how do you
think we can possibly learn anything
about the world when the reward for the
actions is delayed is so sparse like
where is why do you think RL works why
do you think you can learn anything
under such sparse awards whether it's
regular reinforcement learning a deeper
enforcement learning what's your
intuition the kind of part of that is
why is RL why does it need so many
samples so many experiences to learn
from because really what's happening is
when you have a sparse reward you do
something maybe for like I don't know
you take a hundred actions and then you
get a reward and maybe get like a score
of three and I'm like okay three not
sure what that means you go again and
now I get to and now you know that that
sequence of hundred actions that you did
the second time around somehow was worse
than the sequence of hundred actions you
did the first time around but that's
tough to now know which one of those
were better or worse some might have
been good and bad in either one and so
that's why I need so many experience but
once you have enough experiences
effectively rlist easing that apart it's
time to say okay when what is
consistently there when you get a higher
reward and what's consistently there
when you get a lower reward and then
kind of the magic of sums is the policy
grant update is to say
now let's update the neural network to
make the actions that were kind of
present when things are good more likely
and make the actions that are present
when things are not as good less likely
so that's that is the counterpoint but
it seems like you would need to run it a
lot more than you do even though right
now people could say that RL is very
inefficient but it seems to be way more
efficient than one would imagine on
paper that the the simple updates to the
policy the policy gradient that that's
somehow you can learn is exactly users
said what are the common actions that
seem to produce some good results that
that somehow can learn anything it seems
counterintuitive at least did is there
some intuition behind yeah so I think
there's a few ways to think about this
the way I Tennant
about it mostly originally when so when
we started working on deep reinforcement
learning here at Berkeley which was
maybe two thousand eleven twelve
thirteen around that time
challenge Schulman was a PhD student
initially kind of driving it too forward
here and did it the way we thought about
it at the time was if you think about
rectified linear units or kind of break
the fire type neural networks what do
you get you get something that's
piecewise linear feedback control and if
you look at the literature linear
feedback control is extremely successful
can solve many many problems
surprisingly well
I remember for example when we did
helicopter flight if you're in a
stationary flight regime not a non
station by the stationary flight regime
like hover you can use linear feedback
control to stabilize a helicopter a very
complex dynamical system but the
controller is relatively simple and so I
think that's a big part of is that if
you do feedback control even though the
system you control can be very very
complex often relatively simple control
architectures can already do a lot but
then also just linear is not good enough
and so one way you can think of these
neural networks is that in sometimes
they tile the space which people were
already trying to do more by hand or
with finite state machines say this
linear controller here this leaner
controller here you'll network learns
that alva spins a linear controller here
another linear controller here but it's
more subtle than that yeah and so it's
benefiting from this linear control
aspect is benefiting from the tiling but
it's somehow tiling it one dimension at
a time because if let's say you have a
two layer network even the hidden layer
you make a transition from active to
inactive or the other way around that is
essentially one axis but not acts as a
line but one direction that you change
and so you have this kind of very
gradual tiling of the space we have a
lot of sharing between the linear
controllers that tile the space and that
was always my intuition s of why to
expect that this might work pretty well
it's essentially leveraging the fact
that linear feedback control is so good
but of course not enough and this is a
gradual tiling of the space with linear
feedback controls that share a lot of
expertise across them so that that's
that's really nice intuition do you
think that scales to the more and more
general problems of when you start going
up the number of controllers dimensions
when you start going down in terms of
how often you get a clean reward signal
does that intuition carry forward to
those crazy or weird or worlds that we
think of as the real world so I think
where things get really tricky in the
real world compared to the things we've
looked at so far with great success in
reinforcement learning is
the time skills which takes us to an
extreme so when you think about the real
world I mean I don't know maybe some
student decided to do a a PhD here right
okay that's that's the decision that's a
very high-level decision but if you
think about their lives I mean any
person's life it's a sequence of muscle
fiber contractions and relaxations and
that's how you interact with the world
and that's a very high frequency control
thing but it's ultimately what you do
and how you affect the world until I
guess we have brain readings and you can
maybe do it slightly differently but
typically that's how you affect the
world and the decision of doing a PhD is
like so abstract relative to what you're
actually doing in the world and I think
that's where credit assignment becomes
just completely beyond what any current
RL algorithm can do and we need
hierarchical reasoning at a level that
is just not available at all yet where
do you think we can pick up hierarchical
reasoning by which mechanisms yeah so
maybe let me highlight what I think the
limitations are of what already was done
20-30 years ago in fact you'll find
reasoning systems that reason over
relatively long horizons but the
problems that they were not grounded in
the real world so people would have to
hand design some kind of logical
dynamical descriptions of the world and
that didn't tie into perception and so
then time to real objects and so forth
and so that that was a big gap now with
deep learning we start having the
ability to really see with sensors
process that and understand what's in
the world and so it's a good time to try
to bring these things together one I see
a few ways of getting there one way to
get there would be to say deep learning
can get bolted on somehow to some of
these more traditional approaches now
bolted on would probably mean you need
to do some kind of end-to-end training
where you say my deep learning
processing somehow leads to a
representation that in Perm uses some
kind of traditional underlying dynamical
systems that can be used for planning
and that's for example the direction
Aviv Tamar and the North Korea touch
here have been pushing with causal info
gone and of course other people to that
that's that's one way can we somehow
force it into the form factor that is
amenable to reasoning
another direction we've been thinking
about for a long time and they didn't
make any progress on was more
information theoretic approaches so the
idea there was that what it means to
take high-level action is to take and
choose a latent variable now that tells
you a lot about what's gonna be the case
in the future because that's what it
means to to take a high-level action I
say what I decide I'm gonna navigate to
the gas station because need to get gas
for my car well that'll now take five
minutes to get there but the fact that I
get there I could already tell that from
the high-level action it took much
earlier that we had a very hard time
getting success with not saying it's a
dead-end necessarily but we had a lot of
trouble getting that to work and then we
start revisiting the notion of what are
we really trying to achieve what we're
trying to achieve is non ously hierarchy
per se but you could think about what
does hierarchy give us what it's we hope
it would give us is better credit
assignment kind of what is better credit
ominous is given is giving us it gives
us faster learning right and so faster
learning is ultimately maybe what we're
after and so that's what we ended up
with the RL squared paper on learning -
reinforcement learn which at a time
rocky duan LED and that's exactly the
meta learning approach or is say okay we
don't know how to design hierarchy we
know what we want to get from it let's
just enter an optimize for what want to
get from it and see if it might emerging
we saw things emerge the maze navigation
had consistent motion down hallways
which is what you want a hierarchical
control should say I want to go down
this hallway and then when there is an
option to take a turn I can this art
will take a turn or not and repeat even
had the notion of where have you been
before or not do not revisit places
you've been before it still didn't scale
yet to the real world kind of scenarios
I think you had in mind but it was some
sign of life that maybe you can meta
learn these hierarchal concepts I mean
it seems like through these meta
learning concepts get at the what I
think is one of the
hardest and most important problems of
AI which is transfer learning so it's
generalization how far along this
journey towards building general systems
are we being able to do transfer
learning well so there's some signs that
you can generalize a little bit but do
you think we're on the right path or
it's totally different breakthroughs are
needed to be able to transfer knowledge
between different learned models yeah
I'm I'm pretty tired on this and then I
think there are some very many there
there's just some very impressive
results already right I mean yes I would
say when even with the initial and a big
breakthrough in 2012 with Aleks net
right the initial the initial thing is
okay great this does better on imagenet
hands image recognition but then
immediately thereafter that was of
course the notion that Wow
what was learned on image net and you
now want to solve a new task you can
fine-tune Aleks net for new tasks and
that was often found to be the even
bigger deal that you learned something
that was reusable which was not often
the case before usually machine learning
you learned something for one scenario
and that was it and that's really
exciting I mean that's just a huge
application that's probably the biggest
success of transfer learning today in
terms of scope and impact that was huge
breakthrough and then recently I feel
like similar kind of but by scaling
things up it seems like this has been
expanded upon like people training even
bigger networks they might transfer even
better if you looked at for example some
of the opening eye results on language
models and some of the recent Google
results on language models they are
learned for just prediction and then
they get reused for other tasks and so I
think there is something there where
somehow if you train a big enough model
on enough things it seems to transfer
some deepmind results I thought were
very impressive unreal results where it
was learned to navigate mazes in ways
where it wasn't just
reinforcement learning going to have
other objectives was optimizing for so I
think there's a lot of interesting
results already I think maybe words hard
to wrap my head around this to which
extend or when do we call something
generalization right or the levels of
generalization involved in these
different tasks alright so you draw this
by the way just to frame things you've
heard you say somewhere it's the
difference between learning to master
versus learning to generalize that it's
a nice line to think about and it guess
you're saying that's a gray area of what
learning to master and learning to
generalize where once think I might have
heard this I might have heard it
somewhere else and I think it might have
been one of one of your interviews and
maybe the one with yo show Benjamin on
hundred percent sure but I like the
example I'm gonna act not sure who it
was but the example was essentially if
you use current deep learning techniques
what we're doing to predict let's say
the relative motion of our planets it
would do pretty well but then now if a
massive new mass enters our solar system
it would prompt predict what will happen
right and that's a different kind of
journal is a Shahnaz a generalization
that relies on the ultimate simplest
simplest explanation that we have
available today to explain the motion of
planets where I was just pattern
recognition could predict our current
solar system motion pretty well no
problem and so I think that's an example
of a kind of generalization that is a
little different from what we've
achieved so far and it's not clear if
just you know regularizing more I'm
forcing it to come up with a simpler
simpler simple experience but it's not
simple but that's what physics
researchers do right to say can I make
this even simpler how simple can I get
this what's a simplest equation I can
explain everything right yeah the master
equation for the entire dynamics of the
universe we haven't really pushed that
direction as hard in in deep learning I
would say not sure if it should be
pushed but it seems a kind of
generalization you get from that that
you don't get in our current methods
so far so I just talked to vladimir
vapnik for example who was a
statistician the statistical learning
and he kind of dreams of creating these
are the a equals e equals mc-squared for
learning right the general theory of
learning do you think that's a fruitless
pursuit in the near term in within the
next several decades I think that's a
really interesting pursuit and in the
following sense and that there is a lot
of evidence that the brain is pretty
modular and so I wouldn't maybe think of
it as the theory maybe the the
underlying theory but more kind of the
principle where there have been findings
where people who are blind will use the
part of the brain usually used for
vision for other functions and even
after some kind of if people will get
rewired in some way they might I'm able
to reuse parts of their brain for other
functions and so what that suggests is
some kind of modularity and I think it
is a pretty natural thing to strive
forward to see can we find that
modularity can we find this thing of
course it's not every part of the brain
is not exactly the same not everything
can be rewired arbitrarily but if you
think of things like the neocortex which
is pretty big part of the brain that
seems fairly modular from what the
findings so far can you design something
equally modular and if you can just grow
it it becomes more capable probably I
think that would be the kind of
interesting underlying principle to
shoot for that is not unrealistic do you
think you prefer math or empirical trial
and error for the discovery of the
essence of what it means to do something
intelligent
so reinforcement learning embodies both
groups right then prove that something
converges prove the bounds and then at
the same time a lot of those successes
are well let's try this and see if it
works so which do you gravitate towards
how do you think of those two parts of
your brain so
maybe I would prefer we could make the
progress with mathematics and the reason
maybe I would prefer that is because
because often if you have something you
can mathematically formalise you can
leapfrog a lot of experimentation and
experimentation takes a long time to get
through and a lot of trial and error
kind of reinforcement learning your
research process but you need to do a
lot of trial and error before you get to
a success so if we can leapfrog doubt in
my mind that's what the math is about
and hopefully once you do a bunch of
experiments you start seeing a pattern
you can do some derivations that
leapfrog some experiments but I agree
with you I mean in practice a lot of the
progress has been such that we have not
been able to find the math that allows
it to leapfrog ahead and we are kind of
making gradual progress one step at a
time a new experiment here a new
experiment there that gives us new
insights and gradually building up but
not getting to something yet where we're
just okay here's an equation that now
explains how you know that would be have
been two years of experimentation to get
there but this tells us what the results
going to be unfortunately not so much
yes not so much yeah but your hope is
there in trying to teach robots or
systems to do everyday tasks or even in
simulation what what do you think you're
more excited about imitation learning or
self play
so letting robots learn from humans or
letting robots plan their own to try to
figure out in their own way and
eventually play eventually interact with
humans or to solve whatever problem is
what's the more exciting to you what's
more promising you think as a research
direction so when we look at self play
what's so beautiful about it is goes
back to kind of the challenges in
reinforcement learning so the challenge
of reinforced learning is getting signal
and if you don't never succeed you don't
get any signal in self play you're on
both sides so one of you succeeds and
the beauty is also one of you fails and
so you see the contrast you see the one
version of me that it better
the other version and so every time you
play yourself you get signal and so
whenever you can turn something into
self play you're in a beautiful
situation where you can naturally learn
much more quickly than in most other
reinforced learning environments so I
think I think if somehow we can turn
more reinforcement learning problems
into self play formulations that would
go real really far so far south play has
been largely around games where there is
natural opponents but if we could do
self play if for other things and let's
say I don't know a robot learns to build
a house I mean that's a pretty advanced
thing to try to do for a robot but maybe
it tries to build a hut or something if
that can be done through self play it
would learn a lot more quickly if
somebody can figure that out and I think
that would be something where it goes
closer to kind of the mathematical leap
frogging where somebody figures out a
formalism to it's okay
any RL problem by playing this and this
idea you can turn it into a self play
problem where you get signal a lot more
easily
reality is many problems we don't know
how to turn the self lay and so either
we need to provide detailed reward that
doesn't just reward for achieving a goal
but rewards for making progress and that
becomes time-consuming and once you're
starting to do that let's say you want a
robot to do something you need to give
all this detailed reward well why not
just give a demonstration right because
why not just show the robot and now the
question is how do you show the robot
one way to show is to tally operate the
robot and then the robot really
experiences things and that's nice
because that's really high
signal-to-noise ratio data and we've
done a lot of that and you teach your
robot skills in just 10 minutes you can
teach your robot a new basic skill like
okay pick up the bottle place it
somewhere else that's a skill no matter
where the bottle starts maybe it always
goes on to a target or something
that's fairly is a teacher about with
tally up now what's even more
interesting if you can now teach robot
through third person learning where the
robot watches you do something and
doesn't experience it but just watches
it and says okay well if you're showing
me that that means I should be doing
this and I'm not gonna be using your
hand because I don't get to control your
hand but I'm gonna use my hand I'd do
that mapping and so that's where I think
one of the big breakthroughs has
happened this year this was led by
Chelsea Finn here it's almost like
machine translation for demonstrations
were you have a human demonstration and
the robot learns to translated into what
it means for the robot to do it and that
was a meta learning for a Malaysian
learn from one to get the other and that
I think opens up a lot of opportunities
to learn a lot more quickly so my focus
is on autonomous vehicles do you think
this approach of third-person watching
is about the autonomous driving is
amenable to this a kind of approach so
for autonomous driving I would say it's
third-person is slightly easier and the
reason I'm gonna say slightly easier to
do a third-person is because the hard
dynamics are very well understood so the
easier than of first-person you mean or
easier so I think the distinction
between third-person and first-person is
not a very important distinction for
autonomous driving they're very similar
because the distinction is really about
who turns the steering wheel and or
maybe I'll let me put it differently how
to get from a point where you are now to
a point let's say a couple meters in
front of you and that's a problem that's
very well understood and that's the only
distinction being third and first-person
there whereas with the robot
manipulation interaction forces are very
complex and it's still a very different
thing for autonomous driving I think
there is still the question imitation
versus RL so imitation gives you a lot
more signal I think where imitation is
lacking and needs some extra machinery
is it doesn't in its normal format
doesn't think about goals or objectives
and of course there are versions of
imitation learning inverse reinforce
learning type imitation which also
thinks about goals I think then we're
getting much closer but I think it's
very hard to think of a fully reactive
car generalizing well if it really
doesn't have a notion of objectives to
generalize well to the kind of general
that you would want you'd want more than
just that reactivity that you get from
just behavioral cloning / supervised
learning so a lot of the work whether
its self play
imitation learning would benefit
significantly from simulation from
effective simulation and you're doing a
lot of stuff in the physical world and
in simulation do you have hope for
greater and greater power of simulation
loop being boundless eventually to where
most of what we need to operate in the
physical world would could be simulated
to a degree that's directly transferable
to the physical world are we still very
far away from that so I think we could
even rephrase that question in some
sense please so the power of simulation
right simulators get better and better
of course become stronger and we can
learn more in simulation but there's
also another version which is where you
said the simulator doesn't even have to
be that precise as long as is somewhat
representative and instead of trying to
get one simulator that is sufficiently
precise to learn in and transfer really
well to the real world
I'm gonna build many simulators ensemble
of simulators ensemble of simulators not
any single one of them is sufficiently
representative of the real world such
that it would work if you train in there
but if you train in all of them then
there is something that's good in all of
them the real world will just be you
know another one that's you know cannot
identical to any one of them but just
another one of them another sample from
the distribution of simulators exact we
do live in a simulation so this is just
like oh one other one I'm not sure about
that video it's definitely a very
advanced simulator if it is yeah it's
pretty good one I've talked to this to
Russell is something you think about a
little bit too of course you're like
really trying to build these systems but
do you think about the future of AI a
lot of people have concerned about
safety how do you think about AI safety
as you build robots that are operating
in the physical world what what is uh
yeah how do you approach this problem in
an engineering kind of way in a
systematic way so what a robot is doing
things you
kind of have a few notions of safety to
worry about one is that Throwbot is
physically strong and of course could do
a lot of damage same for cars which we
can think of as robots do in some way
and this could be completely
unintentional so it could be not the
kind of long-term AI safety concerns
that okay a is smarter than us and now
what do we do but it could be just very
practical okay this robot if it makes a
mistake
whether the results going to be of
course simulation comes in a lot there
too to test in simulation it's a
difficult question and I'm always
wondering like I was wondering at let's
go back to drivings a lot of people know
driving well of course what do we do to
test somebody for driving right to get a
driver's license what do they really do
I mean you fill out some test and then
you drive and I mean perfume in suburban
California the driving test is just you
drive around the block pull over you do
a stop sign successfully and then you
know you pull over again and you pretty
much done and you're like okay if a
self-driving car did dad would you trust
it that it can drive and be like no
that's not enough for me to trust but
somehow for humans we've figured out
that somebody being able to do that it's
representative of them being able to do
a lot of other things and so I think
somehow for you must we figured out
representative tests of what it means if
you can do this what you can really do
of course testing you must you must all
want to be tested at all times
self-driving cars the robots can be
tested more often probably you can have
replicas that get testament are known to
be identical because they use the same
neural net and so forth but still I feel
like we don't have this kind of unit
tests or proper tests for for robots and
I think there's something very
interesting to be thought about there
especially as you update things your
software improves you have a better self
driving car suite you updated how do you
know it's indeed more capable on
everything than what you had before that
you didn't have any bad things creep
into it so I think that's a very
interesting direction of research that
there is no real solution yet
except that's somehow for you must we do
because we say okay you have a driving
test you passed you can go on the road
now and you must have accents every like
a million or ten million miles something
something pretty phenomenal compared to
that short test yeah that is being done
so let me ask you've mentioned
you mentioned that Andrew Aang by
example showed you the value of kindness
and to do you think the space of
policies good policies for humans and
for AI is populated by policies that
with kindness or ones that are the
opposite exploitation even evil so if
you just look at the sea of policies we
operate under as human beings or if AI
system had to operate in this real world
do you think it's really easy to find
policies that are full of kindness like
we naturally fall into them or is it
like a very hard optimization problem I
mean there is kind of two optimizations
happening for humans right so for you
most was kinda the very long-term
optimization which evolution has done
for us and we're kind of predisposed to
like certain things and that's in
sometimes what makes our learning easier
because I mean we know things like pain
and hunger and thirst and the fact that
we know about those is not something
that we were taught that's kind of
innate when we're hungry were unhappy
when we're thirsty were unhappy when we
have pain we're unhappy and ultimately
evolution built that into us to think
about this thing so so I think there is
a notion that it seems somehow humans
evolved in general to prefer to get
along in some ways but at the same time
also to be very territorial and kind of
centric to their own tribe is it like it
seems like that's the kind of space we
converge down to it I mean I'm not an
expert in anthropology but it seems like
we're very kind of good within our own
tribe but need to be taught but to be
nice to other tribes well if you look at
Steven Pinker he highlights is pretty
nicely in
better better angels of our nature where
he talks about violence decreasing over
time consistently so whatever attention
whatever teams we pick it seems that the
long arc of history goes towards us
getting along more and more so I hope so
so do you think that do you think it's
possible to cheat teach RL bass robots
the this kind of kindness this kind of
ability to interact with humans this
kind of policy even - let me ask let me
ask a fun one do you think it's possible
to teach RL based robot to love a human
being and to inspire that human to love
the robot back so - like RL based
algorithm that leads to a happy marriage
that's interesting question maybe I'll
oh I'll answer it with with another
question right I mean it's it but I'll
come back to it so another question you
can have is okay
I mean how close does some people's
happiness get from interacting with just
a really nice dog like I mean dogs you
come home that's what dogs did they
greet you they're excited it makes you
happy when you're coming home to your
dog just like okay this is exciting
they're always happy when I'm here and
if they don't greet you because maybe
whatever your partner took them on a
trip or something you might not be
nearly as happy when you get home right
and so the kind of it seems like the
level of reasoning a dog houses is
pretty sophisticated but then it's still
not yet at the level of human reasoning
and so it seems like we don't even need
to achieve human love reason to get like
very strong affection with humans and so
my thinking is why not right why
couldn't with an AI couldn't we achieve
the kind of level of affection that
humans feel among each other or with
friendly animals and so forth it's a
question is it a good thing for us or
not that misses another going right
because I mean
but I don't see why not why not yeah so
he almost says love was the answer maybe
he should say love is the objective
function and then RL is the answer maybe
I'll Peter thank you so much I don't
want to take up more of your time thank
you so much for talking today well
thanks for coming by great to have you
visit
you

Resume

Berikut adalah rangkuman komprehensif dan terstruktur berdasarkan transkrip yang diberikan.

***

# Wawancara Eksklusif: Masa Depan Robotika, Deep Reinforcement Learning, dan Psikologi AI bersama Pieter Abbeel

### Inti Sari (Executive Summary)
Video ini membahas diskusi mendalam dengan Pieter Abbeel, Profesor di UC Berkeley dan Direktur Berkeley Robotics Learning Lab, mengenai perkembangan terkini dalam kecerdasan buatan dan robotika. Percakapan mencakup berbagai topik mulai dari tantangan hardware dalam robotika, mekanisme *deep reinforcement learning* dan generalisasi, hingga aspek psikologis dalam interaksi antara manusia dan robot, serta potensi AI untuk mempelajari konsep kebajikan dan kasih sayang.

### Poin-Poin Kunci (Key Takeaways)
*   **Bottle-neck Hardware:** Kendala utama untuk membuat robot atlet tenis profesional bukanlah *software*, melainkan kesiapan *hardware* yang mungkin butuh waktu 10-15 tahun.
*   **Psikologi Robot:** Manusia cenderung mempersonifikasikan robot, memberikan kesan "kehidupan" bahkan pada robot yang dikontrol skrip sederhana.
*   **Mekanisme RL:** *Deep Reinforcement Learning* bekerja efektif karena jaringan saraf (neural networks) berfungsi sebagai pengontrol umpan balik linier secara bertahap.
*   **Generalisasi & Transfer:** Skala model yang besar (seperti model bahasa) dan *transfer learning* (seperti AlexNet) merupakan kunci keberhasilan AI dalam memecahkan beragam masalah.
*   **Metode Pembelajaran:** *Self-play* dan pembelajaran melalui demonstrasi (imitation learning) adalah metode efisien untuk mengajari tugas kompleks kepada robot.
*   **Etika & Cinta:** Terdapat potensi bagi AI untuk belajar menjadi "baik" dan membangkitkan rasa kasih sayang pada manusia, dengan analogi mirip dengan hubungan manusia dengan anjing.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Tantangan Robotika dan Psikologi Interaksi Manusia-Mesin
*   **Robot Tenis vs Roger Federer:** Abbeel menjelaskan bahwa menciptakan robot yang bisa mengalahkan petenis profesional seperti Roger Federer lebih merupakan masalah *hardware* daripada *software*. Kemampuan untuk berlari dan mengayun raket membutuhkan teknologi mekanis yang mungkin baru siap dalam 10-15 tahun ke depan. Namun, untuk tugas statis seperti memukul bola dari mesin, robot sudah mampu melakukannya.
*   **Psikologi pada Robot:** Manusia seringkali merasakan hubungan psikologis dengan robot. Contohnya adalah robot "Bread" yang sering dipanggil "dia" oleh tim peneliti, atau robot SpotMini yang mengikuti Jeff Bezos. Fenomena ini menunjukkan bahwa manusia secara alami memproyeksikan sifat manusiawi pada mesin.
*   **RL dan Preferensi Manusia:** Dalam *Reinforcement Learning* (RL), robot dapat dioptimalkan untuk menyenangkan manusia. Tantangannya adalah mendefinisikan fungsi imbalan (*reward function*) yang tepat. Pendekatan komparatif (misalnya: "saya lebih suka ini daripada itu") terbukti efektif, seperti yang ditunjukkan dalam eksperimen robot lompatan (*hopper*) yang belajar melakukan *backflip* berdasarkan preferensi.

#### 2. Intuisi di Balik Deep Reinforcement Learning
*   **Sparse Rewards dan Policy Gradients:** RL bekerja dengan meningkatkan probabilitas tindakan yang mengarah pada hasil baik dan mengurangi yang buruk. Meski hadiah (*reward*) seringkali jarang (*sparse*), algoritma ini mampu menemukan pola yang benar melalui banyak sampel.
*   **Neural Network sebagai Kontrol Linier:** Alasan mengapa *Deep RL* berhasil adalah karena jaringan saraf dengan fungsi aktivasi ReLU bertindak sebagai pengontrol umpan balik linier (*linear feedback control*) secara bertahap. Ini mirip dengan cara sistem kontrol linier berhasil menstabilkan helikopter melayang, tetapi dengan jaringan saraf yang menyusun ruang (*tiling space*) dengan kontrol linier yang lebih kompleks.
*   **Hierarki dan Abstraksi Waktu:** Tantangan besar dalam dunia nyata adalah *credit assignment* yang melintasi skala waktu yang sangat berbeda, seperti memutuskan untuk kuliah PhD (tingkat tinggi) versus kontraksi otot (tingkat rendah). Pendekatan hierarkis diperlukan untuk menjembatani kesenjangan ini, menggabungkan persepsi *deep learning* dengan sistem dinamis tradisional.

#### 3. Meta-Learning, Transfer Learning, dan Generalisasi
*   **Meta-Learning (RL Squared):** Penelitian yang dipimpin oleh Rocky Duan menunjukkan bahwa alih-alih merancang hierarki secara manual, kita bisa membiarkan sistem mengoptimalkan apa yang diinginkan. Hasilnya, perilaku hierarkis (seperti navigasi yang konsisten di lorong) bisa muncul dengan sendirinya, meskipun belum sepenuhnya terukur di dunia nyata.
*   **Keberhasilan Transfer Learning:** Kemajuan dimulai dari AlexNet (2012) yang membuktikan model bisa disesuaikan (*fine-tune*) untuk tugas baru. Saat ini, tren penskalaan model yang besar (seperti model bahasa OpenAI dan Google) memungkinkan pembelajaran prediksi yang kemudian dapat digunakan kembali untuk berbagai tujuan lain.
*   **Mastering vs Generalizing:** AI saat ini hebat dalam *mastery* (mengenali pola data yang ada), namun masih kurang dalam *generalizing* (menemukan persamaan mendasar seperti fisika). Contohnya, AI bisa memprediksi gerakan planet, tapi mungkin gagal jika ada massa baru yang tak terduga masuk ke tata surya, karena AI tidak mencari "persamaan master" yang paling sederhana.

#### 4. Strategi Pembelajaran Robot: Self-Play dan Demonstrasi
*   **Self-Play:** Metode ini memungkinkan AI belajar sangat cepat dengan bermain melawan dirinya sendiri (misalnya dalam catur atau game). Tantangan ke depan adalah bagaimana menerapkan formalisme matematika *self-play* pada tugas-tugas di dunia nyata, seperti membangun rumah.
*   **Pembelajaran melalui Demonstrasi:**
    *   *Teleoperation:* Mengontrol robot langsung untuk mengajarkan keterampilan dasar (sinyal tinggi, rendah noise).
    *   *Perspektif Ketiga:* Robot menonton manusia melakukan tugas dan memetakan tindakan tersebut ke tubuhnya sendiri.
    *   *Terobosan Chelsea Finn:* Menggunakan *meta-learning* sebagai "terjemahan mesin" untuk demonstrasi, mempercepat proses pembelajaran robot.
*   **Simulasi:** Untuk mengatasi kesenjangan antara simulasi dan dunia nyata (*sim-to-real*), pendekatan menggunakan *ensemble* simulator (banyak simulator yang tidak sempurna) lebih disukai agar dunia nyata dianggap sebagai sampel lain dari distribusi tersebut.

#### 5. Keselamatan, Kebijakan, dan Masa Depan Emosional AI
*   **Pengujian Keamanan:** Berbeda dengan manusia yang hanya perlu tes mengemudi singkat, robot dan mobil otonom membutuhkan pengujian jutaan mil. Saat ini belum ada "tes unit" standar yang setara untuk memverifikasi keamanan pembaruan perangkat lunak robotika.
*   **Kebajikan dan Sifat Dasar:** Mengacu pada pandangan Steven Pinker dalam *Better Angels of Our Nature*, sejarah menunjukkan tren penurunan kekerasan. Manusia mungkin memiliki sifat kesukuan, tetapi kita bisa diajarkan untuk berbaik hati kepada orang lain.
*   **Potensi Cinta pada AI:** Pertanyaan muncul apakah robot berbasis RL bisa diajarkan untuk mencintai manusia. Menggunakan analogi anjing yang memiliki penalaran canggih dan membangkitkan kebahagiaan pada manusia tanpa memiliki penalaran tingkat manusia, tidak ada alasan mengapa AI tidak bisa mencapai tingkat afeksi yang serupa. Diskusi diakhiri dengan pemikiran filosofis bahwa "Cinta adalah fungsi objektif (*objective function*) dan RL adalah jawabannya."

### Kesimpulan & Pesan Penutup
Video ini menegaskan bahwa masa depan robotika tidak hanya ditentukan oleh kecanggihan algoritma atau kekuatan komputasi, tetapi juga oleh pemahaman kita terhadap psikologi manusia, etika, dan cara mengajarkan kebajikan kepada mesin. Dari tantangan *hardware* hingga kemungkinan adanya ikatan emosional antara manusia dan AI, perjalanan menuju kecerdasan buatan yang umum (*Artificial General Intelligence*) adalah perpaduan antara sains keras dan kebijaksanaan manusia. Pesan penutup yang menggugah adalah bahwa dalam merancang AI, mungkin "cinta" bisa menjadi fungsi tujuan utama yang kita kejar.

Read

file updated 2026-02-13 13:22:54 UTC