AlphaZero and Self Play (David Silver, DeepMind)

AlphaZero and Self Play (David Silver, DeepMind) | AI Podcast Clips

e77NkSjnyH4 • 2020-04-04

Transcript preview

Open

Kind: captions
Language: en
so the next incredible step right really
the profound step is probably alphago
zero I mean it's arguable I kind of see
them all as the same place but really
and perhaps you were already thinking
that alphago zeros the natural it was
always going to be the next step
but it's removing the reliance on human
expert games for pre-training as you
mentioned so how big of an intellectual
leap was this that that self play could
achieve superhuman level performance in
its own and maybe could you also say
what is self play I kind of mentioned if
you tell us but so let me start with
self play so the idea of self play is
something which is really about systems
learning for themselves but in the
situation where there's more than one
agent and so if you're in a game the
game is a played between two players
then self play is really about
understanding that game just by playing
games against yourself rather than
against any actual real opponent and so
it's a way to kind of um discover
strategies without having to actually
need to go out and play against any
particular human player for example the
main idea of alpha zero was really to
you know try and step back from any of
the knowledge that we'd put into the
system and ask the question is it
possible to come up with a single
elegant principle by which a system can
learn for itself all of the knowledge
which it requires to play to play a game
such as go importantly by taking
knowledge out you not only make the
system less brittle in the sense that
perhaps the knowledge you were putting
in was was just getting in the way and
maybe stopping the system learning for
itself but also you make it more general
the more knowledge you put in the harder
it is for a system to actually be placed
taken out of the system in which it's
kind of been designed and placed in some
other system that maybe would need a
completely different knowledge base to
to understand and perform well and so
the real goal here is to strip out all
of the knowledge that we put in to the
point that we can just plug it into
something totally different and that to
me is really you know the the promise of
AI is that we can have systems such as
that which you know no matter what the
goal is no matter what goal we set to
the system we can come up with we have
an algorithm which can be placed into
that world into that environment and can
succeed in achieving that goal and then
that that's to me is almost the the
essence of intelligence if we can
achieve that and so alpha zero is a step
towards that and it's a step that was
taken in the context of two-player
perfect information games like go and
chess we also applied it to Japanese
chess so just to clarify the first step
was alphago zero the first step was to
try and take all of the knowledge out of
alphago
in such a way that it could play in a in
a fully self discovered way purely from
self play and to me the the motivation
for that was always that we could then
plug it into other domains but we saved
that bat until later well in in fact I
mean just for fun I could tell you
exactly the moment where where the idea
for alpha zero occurred to me because I
think there's maybe a lesson there for
for researchers who kind of too deeply
embedded in their in their research and
you know working 24/7 to try and come up
with the next idea which is actually
occurred to me on honeymoon like it's my
most fully relaxed state really enjoying
myself and and just being this like the
algorithm for alpha zero just appeared I
come and in in its full form and this
was actually before we played against
Lisa Dahl
but we we just didn't I think we were so
busy
trying to make sure we could beat the
the world champion that it was only
later that we had the opportunity to
step back and start examining that that
sort of deeper scientific question of
whether this could really work so
nevertheless so self play is probably
one of the most sort of profound ideas
that it represents to me at least
artificial intelligence but the fact
that you could use that kind of
mechanism to again be more class players
that's very surprising so we kind of to
be it feels like you have to train in a
large number of experts so was it
surprising to you what was the intuition
can you sort of think not necessarily at
that time even now what's your intuition
why this thing works so well I was able
to learn from scratch well let me first
say why we tried it so we tried it both
because I feel that it was the deeper
scientific question to to be asking to
make progress towards AI
and also because in general in my
research I don't like to do research on
questions for which we already know the
likely outcome I don't see much value in
running an experiment where you're 95%
confident that that you will succeed and
so we could have tried you know maybe to
to take alphago and do something which
we we knew for sure it would succeed on
but much more interesting to me was to
try try it on the things which we
weren't sure about and one of the big
questions on our minds back then was you
know could you really do this with self
play alone how far could that go would
it be as strong and honestly we weren't
sure yeah it was 50/50 I think you know
we I really if you'd asked me I wasn't
confident that it could reach the same
level as these systems but it felt like
the right question to ask and even if
even if it had not achieved the same
level I felt that that was an important
direction to be studying and so then lo
and behold it actually ended up our
performing the
version of of alphago and indeed was
able to beat it by 100 games to zero so
what's the intuition as to as to why I
think that the intuition to me is clear
that whenever you have errors in a in a
system as we did in alphago alphago
suffered from these delusions
occasionally it would misunderstand what
was going on in a position and Miss
evaluate it how can how can you remove
all of these these errors errors arise
from many sources for us they were
arising both from you know it started
from the human data but also from the
from the nature of the search and the
nature of the algorithm itself but the
only way to address them in any complex
system is to give the system the ability
to correct its own errors it must be
able to correct them it must be able to
learn for itself when it's doing
something wrong and correct for it and
so it seemed to me that the way to
correct delusions was indeed to have
more iterations of reinforcement
learning that you know no matter where
you start you should be able to correct
those errors until it gets to play that
out and understand oh well I thought
that I was going to win in this
situation but then I ended up losing
that suggests that I was miss evaluating
something and there's a hole in my
knowledge and now now the system can
correct for itself and and understand
how to do better now if you take that
same idea and trace it back all the way
to the beginning it should be able to
take you from no knowledge from
completely random starting point all the
way to the highest levels of knowledge
that you can achieve in it in a domain
and the principle is the same that if
you give if you bestow a system with the
ability to correct its own errors then
it can take you from random to something
slightly better than random because it
sees the stupid things that the random
is doing and it can correct them and
then it can take you from that slightly
better system and understand what what's
that doing wrong and it takes you on to
the next level and the next level and
and this progress it can go on
indefinitely and indeed you know what
would have happened if we'd carried on
training alphago zero for longer we saw
no sign of it slowing down it's in
improvements or at least it was
certainly carrying
to improve and presumably if you had the
computational resources this this could
lead to better and better systems that
discover more and more so your intuition
is fundamentally there's not a ceiling
to this process the one of the
surprising things just like you said
is the process of patching errors it's
intuitively makes sense they this is a
reinforcement learning should be part of
that process but what is surprising is
in the process of patching your own lack
of knowledge you don't open up other
patches you go you keep sort of like
there's a monotonic decrease of your
weaknesses well let me let me back this
up you know I think science always
should make falsifiable hypotheses yes
so let me let me back out this claim
with a falsifiable hypothesis which is
that if someone was to in the future
take alpha zero as an algorithm and run
it on with greater computational
resources that we had available today
then I predict that they would be able
to beat the previous system 100 games to
zero and that if they were then to do
the same thing a couple of years later
that that would be that previous system
hundred games to zero and that that
process would continue indefinitely
throughout at least my human lifetime
presumably the game of girl would set
the ceiling I mean the game of go would
set the ceiling but the game of grow has
10 to the hundred and seventy states in
it so he so the ceiling isn't
unreachable by any computational device
that can be built out of the you know 10
to the 80 atoms in the universe you
asked a really good question which is
you know do you not open up other errors
when you when you correct your previous
ones and the answer is is yes you do and
so so it's a remarkable fact about about
this class of two-player game and also
true of single agent games that
essentially progress will always lead
you to if you have sufficient
representational resource like imagine
you had could represent every state in a
big table
of the game then we we know for sure
that a progress of self-improvement will
lead all the way in the single agent
case to the optimal possible behavior
and in the two-player case to the
minimax optimal behavior that is that
the best way that I can play knowing
that you're playing perfectly against me
and so so for those cases we know that
even if you do open up some new error
that in some sense you've made progress
you've you're progressing towards the
the best that can be done so alphago was
initially trained on expertise with some
self play alphago zero removed the need
to be trained and experts and then
another incredible step for me because I
just love chess is to generalize that
further to be in alpha zero to be able
to play the game of go beating alphago
zero and alphago and then also being
able to play the check at the game of
chess and others so what was that step
like what's the interesting aspects
there that required to make that happen
I think the remarkable observation which
we saw with alpha zero was that actually
without modifying the algorithm at all
it was able to play and crack some of a
i's greatest previous challenges in
particular we dropped it into the game
of chess and unlike the previous systems
like deep blue which had been worked on
for you know years and years and we were
able to beat the world's strongest
computer chess program convincingly
using a system that was fully discovered
by its own from from scratch with its
own principles and in fact one of the
nice things that that we found was that
in fact we also achieved the same result
in in Japanese chess a variant of chess
where where you get to capture pieces
and then place them back down on your on
your own side as an extra piece so much
more complicated variant of chess and we
also beat the world's strongest programs
and reach superhuman performance in that
game too and it was the very first time
that we'd ever run the system
on that particular game was the version
that we published in their paper on on
alpha zero it just works out of the box
literally no no no touching it we didn't
have to do anything and and there it was
superhuman performance no tweaking no no
twiddling and so I think there's
something beautiful about that principle
that you can take an algorithm and
without twiddling anything it just it
just works
now to go beyond alpha zero what's
required alpha zero is is just a step
and there's a long way to go beyond that
to really crack the deep problems of AI
but one of the important steps is to
acknowledge that the world is a really
messy place you know it's this rich
complex beautiful but messy environment
that we live in and no one gives us the
rules like no one knows the rules of the
world at least maybe we understand that
it operates according to Newtonian or
quantum mechanics at the micro level all
according to relativity at the macro
level but that's not a model that's used
to useful for us as people to to operate
in it somehow the agent needs to
understand the world for itself in a way
where no one tells it the rules of the
game and yet it can still figure out
what to do in that world deal with this
stream of observations coming in rich
sensory input coming in actions going
out in a way that allows it to reason in
the way that alphago or or alpha zero
can reason in the way that these go and
chess-playing programs can reason but in
a way that allows it to take actions in
that messy world to to achieve its goals
and so this led us to the most recent
step in the story of alphago which was a
system called mu 0 and mu zero is a
system which learns for itself even when
the rules are not given to it it
actually can be dropped into a system
with messy perceptual inputs we actually
tried it in the in some Atari games the
canonical domains of Atari that have
been used for reinforcement learning and
and this system learned to build a model
of these Atari games there was
sufficient
the rich and useful enough for it to be
able to plan successfully and in fact
that system not only went on to to beat
the state-of-the-art in Atari but the
same system without modification was
able to reach the same level of
superhuman performance in go chess and
shogi that we'd seen in alpha zero
showing that even without the rules the
system can learn for itself just by
trial and error just by playing this
game of go and no one tells you what the
rules are but you just get to the end
and and someone says you know win or
loss you play this game of chess and
someone says we're not lost so you play
a game of breakout in Atari and someone
just tells you you know your score at
the end and the system for itself
figures out essentially the rules of the
system the dynamics of the world how the
world works and that not in any explicit
way but just implicitly enough
understanding for it to be able to plan
in that in that system in order to
achieve its goals and that's the you
know that's the fundamental price they
have to go through when you're facing
any uncertain kind of environment they
would in the real world it's figuring
out the sort of the rules the basic
rules of the game that's right so
there's a lot I mean the ad that that
allows it to be applicable to basically
any domain that could be digitized in
the way that it needs to in order to be
consumable sort of in order for the
reinforcement learning framework to be
able to sense the environment to be able
to act anywhere and so on the full
reinforcement learning problem needs to
deal with with worlds that are unknown
and and complex and and the agent needs
to learn for itself how to deal with
that and so museu I was as a step a
further step in that direction
you

Resume

Berikut adalah rangkuman komprehensif dan terstruktur berdasarkan transkrip yang Anda berikan.

***

# Revolusi AI: Dari AlphaGo Zero hingga MuZero – Belajar Tanpa Data Manusia

### Inti Sari (Executive Summary)
Video ini membahas evolusi luar biasa kecerdasan buatan (AI) yang dikembangkan oleh DeepMind, bermula dari **AlphaGo Zero** yang menghilangkan ketergantungan pada data manusia, hingga **AlphaZero** yang mampu menguasai berbagai permainan strategi dengan satu algoritma umum. Diskusi mencapai puncaknya pada pengenalan **MuZero**, sistem AI canggih yang tidak hanya belajar bermain tanpa data manusia, tetapi juga mempelajari aturan permainan itu sendiri dari nol, membuka jalan bagi AI yang dapat beradaptasi di lingkungan dunia nyata yang kompleks dan tidak terstruktur.

### Poin-Poin Kunci (Key Takeaways)
*   **Penghapusan Data Manusia:** AlphaGo Zero membuktikan bahwa AI bisa mencapai tingkat kemahiran superhuman melalui *self-play* (bermain sendiri) tanpa perlu data dari permainan ahli manusia.
*   **Generalisasi Algoritma:** AlphaZero menggunakan satu algoritma yang sama untuk menguasai Go, Catur, dan Shogi tanpa modifikasi spesifik (*out of the box*), mengalahkan program komputer terkuat di dunia.
*   **Kompleksitas vs Keterbatasan Fisik:** Kompleksitas permainan Go (dengan $10^{170}$ kemungkinan keadaan) jauh melampaui jumlah atom di alam semesta ($10^{80}$), membuat pendekatan *brute force* mustahil dilakukan.
*   **Belajar Tanpa Aturan:** MuZero melampaui pendahulunya dengan kemampuan belajar dalam situasi di mana aturan tidak diberikan sebelumnya, hanya dengan mengamati input sensorik dan hasil tindakan.
*   **Model Dunia Internal:** MuZero berhasil membangun model internalnya sendiri untuk memahami lingkungan yang berantakan, terbukti dengan kesuksesannya di game Atari dan permainan papan tingkat tinggi.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Konsep AlphaGo Zero dan Intuisi *Self-Play*
AlphaGo Zero hadir sebagai langkah evolusioner dari AlphaGo sebelumnya. Perubahan terbesar terletak pada penghapusan ketergantungan pada data permainan ahli manusia (*human expert games*) untuk pra-pelatihan. Sebagai gantinya, sistem menggunakan metode **self-play**, di mana AI belajar dengan bermain melawan dirinya sendiri. Metode ini memungkinkan sistem menemukan strategi baru tanpa bias dari pengetahuan manusia yang mungkin terbatas.

Ide mengenai sistem yang dapat belajar segalanya dari nol ini tercetus secara utuh saat pembicara sedang berbulan madu, jauh sebelum pertandingan penting melawan juara dunia. Filosofi di baliknya adalah untuk melangkah mundur dari pengetahuan manusia dan menemukan prinsip yang elegan bagi sistem untuk mempelajari segala hal yang dibutuhkan demi mencapai tujuannya.

#### 2. Keunggulan Menghapus Pengetahuan Manusia
Menghapus pengetahuan manusia dari proses pelatihan ternyata memiliki dua manfaat utama:
1.  **Mengurangi Kerapuhan (*Brittleness*):** Pengetahuan manusia terkadang dapat menghambati kemampuan AI untuk menemukan solusi yang lebih optimal atau tidak konvensional.
2.  **Generalisasi:** Tanpa beban spesifik pengetahuan manusia, algoritma menjadi lebih umum dan mudah diterapkan di berbagai domain lain. Ini adalah janji utama AI: sebuah algoritma yang bisa ditempatkan di lingkungan mana pun untuk mencapai tujuan tertentu.

#### 3. AlphaZero: Satu Algoritma untuk Segala Permainan
AlphaZero dikembangkan untuk membuktikan bahwa satu algoritma yang sama dapat menguasai berbagai permainan papan kompleks.
*   **Tanpa Tweaking:** AlphaZero mampu mengalahkan program komputer terkuat di dunia untuk Catur, Shogi (Catur Jepang), dan Go tanpa perubahan atau penyesuaian algoritma (*tweaking*). Sistem ini bekerja "langsung dari kotak" (*out of the box*).
*   **Penemuan Prinsip:** Sistem ini menemukan prinsip-prinsip permainan dari awal, menghasilkan gaya bermain yang unik dan superhuman.

#### 4. Kompleksitas dan Batas Optimalisasi
Permainan Go sering dianggap sebagai standar tertinggi kompleksitas karena memiliki $10^{170}$ kemungkinan keadaan (states). Angka ini jauh lebih besar dibandingkan jumlah atom di alam semesta yang diperkirakan hanya $10^{80}$. Hal ini membuat mustahil bagi perangkat keras komputer manapun untuk menyelesaikannya dengan pendekatan *brute force*.

Meskipun proses pembelajaran mungkin membuka kesalahan baru, fakta yang luar biasa adalah bahwa kemajuan dalam *self-play* akan selalu mengarah pada perilaku yang optimal (untuk satu agen) atau *minimax optimal* (untuk dua pemain), asalkan sumber daya representasinya mencukupi.

#### 5. MuZero: Menghadapi Dunia yang "Berantakan"
AlphaZero memiliki keterbatasan: ia membutuhkan aturan yang diberikan secara eksplisit. Namun, dunia nyata adalah tempat yang berantakan dan tidak ada yang memberikan aturan kepada kita.
*   **Masalah Input Sensorik:** Di dunia nyata, agen perlu memahami lingkungan melalui aliran observasi dan input sensorik, serta merespons dengan tindakan.
*   **Solusi MuZero:** MuZero dirancang untuk belajar bahkan ketika aturan tidak diberikan. Sistem ini diuji pada game Atari yang memiliki input visual yang "berantakan" dan berhasil membangun modelnya sendiri tentang aturan dan dinamika lingkungan tersebut.

#### 6. Performa Luar Biasa MuZero
Tanpa modifikasi apa pun, sistem yang sama yang digunakan MuZero untuk mempelajari game Atari, mampu mencapai performa superhuman dalam permainan Go, Catur, dan Shogi. Ini membuktikan bahwa MuZero tidak hanya menghafal aturan, tetapi benar-benar membangun pemahaman yang mendalam tentang lingkungan tempat ia berada.

---

### Kesimpulan & Pesan Penutup
Evolusi dari AlphaGo Zero ke AlphaZero, dan akhirnya ke MuZero, menunjukkan pergeseran paradigma dalam kecerdasan buatan: dari meniru manusia, belajar tanpa data manusia, hingga belajar tanpa aturan eksplisit. MuZero merepresentasikan langkah signifikan menuju AI yang umum dan adaptif, yang mampu memahami dunia yang kompleks melalui observasi dan pengalaman, bukan sekadar pemrograman statis. Pesan utamanya adalah bahwa masa depan AI terletak pada sistem yang bisa belajar membangun model dunianya sendiri secara otonom.

Read

file updated 2026-02-13 13:23:24 UTC