Transcript
NNr6gPelJ3E • Roman Yampolskiy: Dangers of Superintelligent AI | Lex Fridman Podcast #431
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0786_NNr6gPelJ3E.txt
Kind: captions
Language: en
if we create General super intelligences
I don't see a good outcome longterm for
Humanity so that is X risk existential
risk everyone's dead there is srisk
suffering risks where everyone wishes
they were dead we have also idea for IR
risk iyy risks where we lost our meaning
the systems can be more creative they
can do all the jobs it's not obvious
what you have to contribute to a world
where super intelligence exists of
course you can have all the variants you
mentioned where we are safe we are kept
alive but we are not in control we are
not deciding anything we like animals in
a zo there is
again possibilities we can come up with
as very smart humans and then
possibility is something a thousand
times smarter can come up with for
reasons we cannot
comprehend the following is a
conversation with Roman yski an AI
Safety and Security research
and author of a new book titled AI
unexplainable unpredictable
uncontrollable he argues that there's
almost 100% chance that AGI will
eventually destroy human civilization as
an aside let me say that I will have
many often technical conversations on
the topic of AI often with Engineers
building the state of the art AI systems
I would say those folks put the infamous
P Doom or the probability of a GI
killing all humans at around 1 to
20% but it's also important to talk to
folks who put that value at 70 80
90 and is in the case of Roman at
99.99 and many more 9es
per. I'm personally excited for the
future and believe it will be a good one
in part because of the amazing
technological innovation we humans
create but we must absolutely not do so
with blinders on ignoring the possible
risks including existential risks of
those
Technologies that's what this
conversation is about this is the Lex
Freedman podcast to support it please
check out our sponsors in the
description and now dear friends here's
Roman
yski what to you is the probability that
super intelligent AI will destroy all
human civilization what's the time frame
let's say 100 years in the next 100
years so the problem of controlling AI
or super intelligence in my opinion is
like a problem of creating a Perpetual
safety Machine by analogy with perpetual
motion machine is impossible yeah we may
succeed and do a good job with GPT 5 6 7
but they just keep improving learning
eventually self-modifying interacting
with the environment interacting with
malevolent
actors the difference between cyber
security narrow AI safety and safety for
General AI for super intelligence is
that we don't get a second chance with
cyber security somebody hacks your
account what's the big deal you get a
new password new credit card you move
on here if we're talking about
existential risks you only get one
chance so you're really asking me what
are the chances that will create the
most complex software ever on the first
try with zero bugs and it will continue
have zero bugs for 100 years or
more so there is an incremental
Improvement of systems leading up to AGI
to you it doesn't matter if we can keep
those safe there's going to be one level
of
system at which you cannot possibly
control
it I don't think we so far have made any
system safe at the level of capability
they display they already have made
mistakes we had accidents they've been
jailbroken I don't think there is a
single large language model today which
no one was successful at making do
something developers didn't intend it to
do but there's a difference between
getting it to do something unintended
getting it to do something that's
painful costly destructive and something
that's destructive to the level of
hurting billions of people or hundreds
of millions of people billions of people
or the entirety of human civilization
that's a big leap exactly but the
systems we have today have capability of
causing x amount of damage so then they
fail that's all we get if we develop
systems capable of impacting all of
humanity all of universe the damage is
proportionate what to you are the
possible ways that
such kind of mass murder of humans can
happen it's always a wonderful question
so one of the chapters in my new book is
about unpredictability I argue that we
cannot predict what a smarter system
will do so you're really not asking me
how super intelligence will kill
everyone you're asking me how I would do
it and I think it's not that interesting
I can tell you about the standard you
know nanotag synthetic bionuclear super
intelligence will come up with something
completely new completely
super we may not even recognize that as
a possible path to achieve that goal so
there is like a unlimited level of
creativity in terms of how humans could
be
killed but you know we could still
investigate possible ways of doing it
not how to do it but the at the end what
is the methodology that does it you know
shutting off the
power and then humans start killing each
other maybe because the resource is are
really constrained that there then
there's the actual use of weapons like
nuclear weapons or developing artificial
pathogens viruses that kind of stuff we
could still kind of think through that
and defend against it right there's a
ceiling to the creativity of mass murder
of humans here right the options are
limited they are limited by how
imaginative we are if you are that much
smarter that much more creative you are
capable of thinking across multiple
domains do no research in physics and
biology you may not be limited by those
tools if squirrels were planning to kill
humans they would have a set of possible
ways of doing it but they would never
consider things we can come up so are
you are you thinking about mass murder
and destruction of human civilization or
you thinking of with squirrels you put
them in a zoo and they don't really know
they're in a zoo if we just look at the
entire set of undesirable
trajectories majority of them are not
going to be death most of them are going
to be just like
uh things like Brave New World
where you know the squirrels are fed
dopamine and they're all like doing some
kind of fun activity and the sort of the
fire the soul of humanity is lost
because of the drug that's fed to it or
like literally in a zoo we're in a zoo
we're doing our thing we're like playing
a game of
Sims and the actual players playing that
game are AI systems those are all
undesirable because sort of sort of the
the free
will the fire of human consciousness is
dimmed through that process but it's not
killing humans so like are you thinking
about that or is the biggest concern
literally the extinctions of humans I
think about a lot of things so there is
X risk existential risk everyone's dead
there is srisk suffering risks where
everyone wishes they were that we have
also idea for IR risk iy risks where
lost their meaning the systems can be
more creative they can do all the jobs
it's not obvious what you have to
contribute to a world where super
intelligence exists of course you can
have all the variants you mentioned
where we are safe we are kept alive but
we are not in control we are not
deciding anything we like animals in a
zo there is
again possibilities we can come up with
as very smart humans and then
possibility is something a thousand
smarter can come up with for reasons we
cannot comprehend I would love to sort
of dig into each of those X risk srisk
and IR risk so can can you like Linger
on irisk what is that so Japanese
concept of viky guy you find something
which allows you to make money you are
good at it and the society says we need
it so like you have this awesome job you
are
podcaster gives you a lot of meaning you
have a good life I assume you happy
mhm that's what we want most people to
find to have for many
intellectuals it is their occupation
which gives them a lot of meaning I am a
researcher philosopher scholar that
means something to me in a world where
an artist is not feeling appreciated
because his art is just not competitive
with what is produced by machines or
writer or scientist will lose a lot of
that
and at the lower level we're talking
about complete technological
unemployment we're not losing 10% of
jobs we're losing all jobs what do
people do with all that free time what
happens then everything Society is built
on is completely modified in one
generation it's not a slow process where
we get to kind of figure out how to live
that new lifestyle but it's uh pretty
quick in that world can't humans just do
what humans currently do with chess play
each other have tournaments even though
AI systems are far superior at this time
in chess so we just create artificial
games or for us they're real like the
Olympics and we do all kinds of
different competitions and have fun
Focus maximize the fun and and uh let uh
the AI focus on the productivity it's an
option I have a paper where I try to
solve the value alignment problem for
multiple agents and the solution to
avoid compromise is to give everyone a
personal virtual Universe you can do
whatever you want in that world you
could be king you could be slave you
decide what happens so it's basically a
glorified video game where you get to
enjoy yourself and someone else takes
care of your needs and the substrate
alignment is the only thing we need to
solve we don't have to get 8 billion
humans to agree on anything mhm so okay
so what why is that not a likely outcome
why can't they systems create video
games for us to lose ourselves in each
each with an individual video game
Universe some people say that's what
happened we're in a simulation and we're
playing that video game and now we're
creating uh what maybe we're creating
artificial threats for ourselves to be
scared about cuz cuz fear is really
exciting it allows us to play the video
game more uh more vigorously and some
people choose to play on a more
difficult level with more con strange
some say okay I'm just going to enjoy
the game high privilege level absolutely
so okay what was that paper on
multi-agent value alignment personal
universes personal universes so so
that's one of the possible outcomes but
what what what in general is the idea of
the paper so it's looking at multiple
agents they're human AI like a hybrid
system whether it's humans and AI or is
it looking at humans or just so this is
intelligent agents in order to solve
value alignment problem I'm trying to
formalize it a little better usually
we're talking about getting AIS to do
what we want which is not well defined
we're talking about creator of a system
owner of that AI Humanity as a whole but
we don't agree on much there is no
universally accepted ethics morals
across cultures religions people have
individually very different preferences
politically and such so even if we
somehow managed all the other aspects of
it programming those fuzzy Concepts and
getting to follow them closely we don't
agree on what to program in so my
solution was okay we don't have to
compromise on room temperature you have
your Universe I have mine whatever you
want and if you like me you can invite
me to visit your Universe we don't have
to be independent but the point is you
can be and virtual reality is getting
pretty good it's going to hit a point
where you can't tell the difference and
if you can't tell if it's real or not
what's the difference so basically give
up on value alignment create and entire
it's like the the Multiverse Theory this
just create an entire universe for you
where your values you still have to
align with that individual they have to
be happy in that simulation but it's a
much easier problem to align with one
agent versus 8 billion agents plus
animals aliens so you convert the
multi-agent problem into a single agent
problem I'm trying to do that yeah okay
is there any way to is so okay that's
giving up on the on the value problem
well is there any way to solve the value
alignment problem where there's a bunch
of humans multiple humans tens of humans
or 8 billion humans that have very
different set of values it seems
contradictory I haven't seen anyone
explain what it means outside of kind of
words which pack a lot make it good make
it desirable make it something they
don't regret but how do you specifically
formalize those Notions how do you
program women I haven't seen anyone uh
make progress on that so far but isn't
that the whole optimization Journey that
we're doing as a human civilization
we're looking at
geopolitics nations are in a state of
Anarchy with each other they start wars
there's
conflict and often times they have a
very different views of what is good and
and what is evil isn't that what we're
trying to figure out just together
trying to converge towards that so we're
essentially trying to solve the value
alignment problem with humans right but
the examples you gave uh some of them
are for example two different religions
saying this is our holy sight and we are
not willing to compromise it in any way
if you can make two holy s sites in
virtual world you solve the problem but
if you only have one it's not divisible
you kind of stuck there but what if we
want to be a tension with each other and
that through that tension we understand
ourselves and we understand the world so
that that's the intellectual Journey
we're on we're on as a human
civilization is we create intellectual
and physical conflict and through that
figure stuff out if we go back to that
idea of simulation and this is a
entertainment kind of giving meaning to
us the question is how much suffering is
reasonable for a video game so yeah I
don't mind you know a video game where I
get heptic feedback there is a little
bit of shaking maybe I'm a little scared
I don't want a game where like kids are
tortured
literally that seems unethical at least
by our human standards are you
suggesting it's possible to remove
suffering if we're looking at human
civilization as an optimization problem
so we know there are some humans who
because of a mutation don't experience
physical pain so at least physical pain
can
be mutated out re-engineered out
suffering in terms of meaning like you
burn the only copy of my book is a
little harder but even there you can
manipulate your honic set point you can
change defaults you can reset problem
with that is if you start messing with
your reward Channel you start
wireheading and uh end up bissing out uh
a little too much well that's the the
question would you really want to live
in a world where there's no suffering
that's a dark question is there some
level of
suffering that reminds us of what this
is all for
I I think we need that but I would
change the overall range so right now
it's negative Infinity to kind of
positive Infinity pain pleasure AIS I
would make it like zero to positive
infinity and being unhappy is like I'm
close to
zero okay so what what's the srisk what
are the possible things that you're
imagining with srisk so Mass suffering
of humans what are we talking about
there caused by AGI so there are many
malevolent actors we can talk about
Psychopaths crazies hackers Doomsday
cults we know from history they tried
killing everyone they tried on purpose
to cause maximum amount of damage
terrorism what if someone malevolent
wants on purpose to torture all humans
as long as possible you solve aging so
now you have functional immortality and
you just try to be as creative as you
can do you think there is actually
people in human history that try to
literally maximize human suffering in
just studying people have done evil in
the world it seems that they think that
they're doing good and it doesn't seem
like they're trying to maximize
suffering they just cause a lot of
suffering as a side effect of doing what
they think is good so there are
different malevolent agents some may be
just gaining personal benefit and
sacrificing others to that cause others
will know for a fact trying to kill as
many people as possible when we look at
recent school shootings if they had more
capable weapons they would take out not
dozens but thousands millions
billions well we don't know
that but that is a terrifying
possibility and we don't want to find
out like if terrorists had access to
nuclear weapons how far would they go is
there a limit to what they're
willing to
do in your senses there's some
malevolent actors where there's no limit
there is
mental mental diseases where people
don't have empathy don't
have this
human quality of understanding suffering
in others and then there's also set of
beliefs where you think you're doing
good uh by killing a lot of
humans again I would like to assume that
normal people never think like that it's
always some sort of psychopaths but yeah
and to you AGI systems can carry that
and uh be more competent at executing
that they can certainly be more creative
they can understand human biology better
understand our molecular structure
genome uh again uh a lot of times uh
torture ends then in individual dies
that limit can be removed as well so if
we're actually looking at x risk and
srisk as the systems get more and more
intelligent don't you think it it's
possible to anticipate the ways they can
do it and defend against it like we do
with the cyber security with the do
security systems right uh we can
definitely keep up for a while I'm
saying you cannot do it indefinitely at
some point the cognitive Gap is too big
the surface you have to defend is
infinite but attackers only need to find
one exploit so to you eventually this is
we're heading off a cliff if we create
General super intelligences I don't see
a good outcome long term for Humanity
the only way to win this game is not to
play it okay well we we we'll talk about
possible solutions and what not playing
it means um but what are the possible
timelines here to you what are we
talking about we're talking about a set
of years decades centuries what do you
think I don't know for sure the
prediction markets right now are saying
2026 for AGI I heard the same thing from
CEO of anthropic dip mine so maybe we
are 2 years away which seems very
soon uh given we don't have a working
safety mechanism in place or even a
prototype for one and there are people
trying to accelerate those timelines
because they feel we're not getting
there quick enough but what do you think
they mean when they say AGI so the
definitions we used to have when people
are modifying a little bit lately
artificial general intelligence was a
system capable of performing in any
domain a human could perform so kind of
you creating this average artificial
person they can do cognitive labor
physical labor where you can get another
human to do it superintelligence was
defined as a system which is superior to
All Humans in all domains now people are
starting to refer to AGI as if it's
super intelligence I made a post
recently where I argued for me at least
if you average out over all the common
human tasks those systems are already
smarter than an average human mhm so
under that definition we we have it
Shane L has this definition of where
you're trying to win in all domains
that's what intelligence is now are they
smarter than Elite individuals in
certain domains of course not they're
not there yet but uh the progress is
exponential see I'm much more concerned
about social
engineering so
to me ai's ability to do something in
the physical world like the the lowest
hanging fruit this the easiest set of
methods is by just getting humans to do
it it's going to be much harder to to uh
be the kind of viruses that take over
the minds of
robots that where the robots are
executing the commands it just seems
like humans social engineering of humans
is much more likely that would be enough
to bootst the whole
process okay just to linger on the term
AGI what's what to you is the difference
between AGI and human level intelligence
uh human level is General in the domain
of expertise of humans we know how to do
human things I don't speak dog language
I should be able to pick it up if I'm a
general intelligence it's kind of
inferior animal I should be able to
learn that skill but I can't at general
intelligence truly Universal general
intelligence should be able to do things
like that humans cannot do to be able to
talk to animals for example to solve
pattern recognition problems of that
type to
do of similar things outside of
our domain of expertise because it's
just not the world
will if we just look at the space of
cognitive abilities we have I just would
love to understand what the limits are
Beyond which an AGI system can reach
like what does that look like what about
about
actual mathematical thinking or uh
scientific
innovation that kind of stuff we know
calculators are smarter than humans in
that narrow domain of addition but is it
humans plus tools versus AGI or just
human raw human intelligence cu cu
humans create tools and with the tools
they become more intelligent so like
there there's a gray area there what it
means to be human when we're measuring
their intelligence so when I think about
it I usually think human with like a
paper and a pencil not human with
internet and anava AI helping but is
that a fair way to think about it cuz
isn't there another definition of human
level intelligence that includes the
tools that humans create but we create
AI so at any point you'll still just add
super intelligence to human capability
that seems like
cheating no controllable tools there is
there is an implied leap that you're
making when AGI goes from tool to uh
entity that can make its own decisions
so if we Define human level intelligence
as everything a human can do with fully
controllable tools it seems like a
hybrid of some kind you're now doing
brain computer interfaces you connecting
it to maybe narrow AI yeah it definitely
increases our
capabilities so what's a good test to
you that uh measures whether uh an
artificial intelligence system has
reached human level intelligence and was
a good test where it has superseded
human level intelligence to reach that
land of AGI I am oldfashioned I like
tting test I have a paper where I equate
passing touring test to solving AI
complete problems because you can encode
any questions about any domain into the
touring test you don't have to talk
about how was your day you can ask
anything and so the system has to be as
smart as a human to pass it in a true
sense but then you would extend that to
U maybe a very long conversation like I
think the Alexa prize was doing that
basically can you do a 20 minute 30
minute conversation with an ass system
it has to be long enough to where you
can make some meaningful decisions about
capabilities absolutely you can Brute
Force very short
conversations so like literally what
does that look like can we do uh can we
construct formally a kind of test that
tests for AGI for AGI it has to be there
I cannot give it a task I can give to a
human and it cannot do it if a human can
for super intelligent it would be
superior on all such tasks not just
average performance so like go learn to
drive car go speak Chinese play guitar
okay great I guess the the following
question is there a test for the kind of
AGI that
would
be uh susceptible to lead to srisk or X
risk susceptible to destroy human
civilization like is there a test for
that you can develop a test which will
give you positives if it lies to you or
has those ideas you cannot develop a
test which rules them out there is
always possibility of what bom calls a
treacherous turn where later on a system
decides for game theoretic reasons
economic reasons to change its behavior
and we see the same with humans it's not
unique to AI for Millennia we tried
developing morals ethics religions uh
light detector tests and then employees
betray the employer spouses betray
family it's a pretty standard thing
intelligent agents sometimes do so is it
is it possible to detect when a AI
system is lying or deceiving you if you
know the truth and it tells you
something false you can detect that but
you cannot know in general every single
time and again the system you're testing
today may not be lying the system you're
testing today may know you are testing
it and so behaving and later on after it
interacts with the environment interacts
with other systems malevolent agents
learns more it may start doing those
things so do you think it's possible to
develop a system where the creators of
the system the developers the program
rers don't know that it's deceiving them
so systems today don't have long-term
planning that is not out they can lie
today if it
optimizes helps them optimize the reward
if they realize okay this human will be
very happy if I tell them the following
they will do it if it brings them more
points and they don't have to kind of
keep track of it it's just the right
answer to this problem every single time
at which point is somebody creating that
intentionally not unintentionally
intentionally creating an AI system
that's doing long-term planning with an
objective function that's defined by the
AI system not by a human well some
people think that if they're that smart
they always good they really do believe
that it's just benevolence from
intelligence so they'll always want
what's best for us some people think
that uh they will be able to detect
problem behaviors and correct them at
the time when we get
there I don't think it's a good idea I
am strongly against it but yeah there
are quite a few people who in general
are so optimistic about this technology
it could do no wrong they want it
developed as soon as possible as capable
as possible so there's going to be
people who believe the more intelligent
it is the more benevolent and so
therefore it should be the one that
defines the objective function that it's
U optimizing when it's doing long-term
planning there are even people who say
okay what's so special about humans
right we removed the gender bias we're
removing race bias why is this pro-human
bias we are polluting the planet we are
as you said you know fight a lot of Wars
kind of violent maybe it's better if
this super intelligent perfect uh
Society comes and replaces us it's
normal stage in the evolution of our
species yeah so somebody says uh let's
develop an AI system that removes the
violent humans from the world and then
it turns out that all humans have
violence in them or the capacity for
violence and therefore all humans are
removed yeah yeah yeah let me ask about
uh Yan laon he's somebody who uh you've
had a few exchanges
with and he's somebody who actively
pushes back against this view that AI is
going to lead to destruction of uh human
civilization also known as uh Ai
dorismar and open source are the best
ways to understand and mitigate the
risks and two AI is not something that
just happens we build it we have agency
in what it becomes hence we control the
risks we meaning humans it's not some
sort of natural phenomena that uh we
have no control over so can you can you
make the case that he's right and can
you try to make the case that he's wrong
I cannot make a case that he's right
he's wrong in so many ways it's
difficult for me to remember all of them
uh he is a Facebook buddy so I have a
lot of fun uh having those little
debates with him so I'm trying to
remember the arguments so one he he says
we are not gifted to this intelligence
from Aliens we are designing it we are
making decisions about it that's not
true it was true then we had expert
systems symbolic AI decision threes
today you set up parameters for a model
and you water this plant you give it
data you give it compute and it grows
and after it's finished growing into
this alien plant you start testing it to
find out what capabilities it has and it
takes years to figure out even for
existing models if it's Str for 6 months
it will take you 2 3 years to figure out
basic capabilities of that system we
still discover new capabilities in
systems which are already out there so
that's that's not the case so just to
linger on that to you the difference
there that there is some level of
emergent intelligence that happens in
our current
approaches so stuff that we don't
hardcode in absolutely that's what makes
it so successful then we had to
painstakingly hardcode in everything
we didn't have much progress now just
spend more money and more compute and
it's a lot more capable and then the
question is when there is emergent
intelligent
phenomena what is the ceiling of that
for you there's no ceiling for uh for
Yan laon I think there's a kind of
ceiling that happens that we have full
control over even if we don't understand
the internals of the emergence how the
emergence happens there's a sense that
we have control and understanding of the
approximate ceiling of capability the
limits of the capability let's say there
is a ceiling it's not guaranteed to be
at a level which is competitive with us
it may be greatly Superior to ours so
what
about his statement about open research
and open source are the best ways to
understand and mitigate the risks
historically he's completely right open
source software is wonderful it's tested
by the community it's de
but we're switching from tools to agents
now you're giving open source weapons to
Psychopaths do we want to open source
nuclear weapons biological weapons it's
not safe to give technology so powerful
to those who may misalign it even if you
are successful at somehow getting it to
work in the first place in a friendly
manner but the difference with nuclear
weapons current AI systems are not akin
to nuclear weapons so the idea there is
you're open sourcing it at this stage
that you can understand it better large
large number of people can explore the
limitation the capabilities explore the
possible ways to keep it safe to keep uh
it secure all that kind of stuff while
it's not at the stage of nuclear weapons
so nuclear weapons there's a no nuclear
weapon and then there's a nuclear weapon
with AI systems there's a gradual
Improvement of capability and you get
to uh perform that Improvement
incrementally and so open source allows
you to study
uh how things go wrong I study the the
very process of emergence study AI
safety on those systems when there's not
a high level of danger all that kind of
stuff it also sets a very wrong
precedence so we open sourced model one
model two model three nothing ever bad
happened so obviously we're going to do
it with model four it's just gradual
Improvement I I don't think it always
works with the precedent like you're not
stuck doing it the
way you always did it just uh it's
that's a precedent of open research and
open development such that we get to
learn together and then the first time
there's a sign of
danger some dramatic thing happen not a
thing that destroys human civilization
but some dramatic demonstration of
capability that can legitimately lead to
a lot of damage then everybody wakes up
and says okay we need to regulate this
we need to come up with safety mechanism
that stops this right but at this time
maybe can educate me but I haven't seen
any illustration of significant damage
done by intelligent AI systems so I have
a paper which collects accidents through
history of AI and they always are
proportionate to capabilities of that
system so if you have Tic Tac to playing
AI it will fail to properly play and
lose the game which it should draw
trivial your spell checker will be
spellward so on uh I stopped collecting
those because there are just too many
examples of AI failing at what they are
capable of we haven't had terrible
accidents in a sense of billion people
got killed absolutely true but in
another paper I argue that those
accidents do not actually prevent people
from continuing with research and
actually they kind of serve like
vaccines a vaccine makes your body a
little bit sick so you can handle the
big disease later much better it's the
same here people will point out you know
that accident AI accident we had where
12 people died everyone's still here 12
people is less than smoking kills it's
not a big deal so we continue so in a
way it will actually be kind of
confirming that it's not that bad it
matters how the deaths happen whether
it's literally Murder By thei system
then one is a problem but if it's
accidents because of increased Reliance
on automation for example so when uh
airplanes are flying in an automated way
maybe the number of plane crashes
increased by 177% or something and then
you're like okay do we really want to
rely on automation I think in a case of
automation airplanes it decrease
significantly okay same thing with
autonomous vehicles like okay uh what
are the pros and cons what are the W
with the trade-offs here you can have
that discussion in an honest way but I
think the kind of things we're talking
about here is mass
scale pain and
suffering caused by AI systems and I
think we need to see illustrations of
that on a very small scale to start to
understand that this is really damaging
versus clippy versus a tool that's
really useful to a lot of people to do
learning to do um summarization of text
to do question answer all that kind of
stuff to generate videos a tool
fundamentally a tool versus an agent
that can do a lot a huge amount of
damage so you bring up example of cars
yes cars were slowly developed and
integrated if we had no cars and
somebody came around and said I invented
this thing it's called cars it's awesome
it kills like a 100,000 Americans every
year let's deploy it m would we deploy
that there's been fear mongering about
cars for a long time from the the the
transition from horses cars there's a
there's a really nice channnel that I
recommend people check out pessimist
archive that documents all the fear
mongering about technology that's
happened throughout history there's
definitely been a lot of fear-mongering
about cars there's a transition period
there about cars about how deadly they
are we can try it took a very long time
for cars to proliferate to the degree
they have now and then you could ask
serious questions uh in terms of the
miles traveled the benefit to the
economy the benefit to the quality of
life that cars do versus the number of
deaths 30 40,000 in the United States
are we willing to pay that price I think
most people when they're rationally
thinking policy makers will say
yes it's we want to
decrease it from 40,000 to zero and do
everything we can to decrease it there's
all kinds of policies incentives you can
create to decrease the risks uh with the
uh deployment of Technology but then you
have to weigh the benefits and the risk
the technology and the same thing would
be done with with with AI you need data
you need to know but if I'm right and
it's unpredictable unexplainable
uncontrollable you cannot make this
decision we're gaining $10 trillion of
wealth but we're losing we don't know
how many people uh you basically have to
perform an experiment on 8 billion
humans without their consent and even if
they want to give you consent they can't
because they cannot give informed
consent they don't understand those
things
right that happens when you do when you
go from the predictable to the
unpredictable very
quickly you just uh but it's not obvious
to me that AI systems would gain
capability so quickly that you won't be
able to collect enough data to study the
sa the benefits and
risks we literally doing it the previous
model we learned about after we finish
training it what it was capable of let's
say we stopped GPT for training run
around human cap capability
hypothetically we start training GPT 5
and I have no knowledge of Insider
training runs or anything and we started
that point of about human and we train
it for the next 9 months maybe 2 months
in it becomes super intelligent we
continue training it at the time when we
start uh testing it it is already a
dangerous system how dangerous I have no
idea but neither people training it at
the training stage but then there's a
testing stage mhm inside the company
they can start getting intuition about
what the system is capable to do you're
saying that somehow from leap from GPT 4
to GPT 5 can
happen the kind of leap where GPT 4 was
controllable in GPT 5 is no longer
controllable and we get no insights from
using GPT 4 about the fact that GPT 5
will be
uncontrollable like that's the that's
the situation you're concerned about
where there leap from n to n plus one
would be such that uncontrollable system
is created
without
any ability for us to anticipate that if
we had capability of ahead of the run
before the training run to register
exactly what capabilities that next
model will have at the end of a training
run and we accurately guessed all of
them I would say you're right we can
definitely go ahead with this run we
don't have that capability from gp4 you
can build up intuition about what GPT 5
will be capable of it's just incremental
progress MH even if that's a big leap in
capability it just doesn't seem like you
can take a leap from a system that's uh
helping you write emails to a system
that's going to destroy human
civilization it seems like it's always
going to be sufficiently incremental
such that we can anticipate the possible
dangers and we're not even talking about
existential risks but just the the kind
of damage can do to civilization it
seems like we'll be able to anticipate
the kinds not the exact but the kinds of
uh risks it might lead to and then
rapidly develop defenses ahead of time
and as the risks emerge we're not
talking just about capabilities specific
tasks we're talking about General
capability to learn maybe like a child
at the time of testing and deployment it
is still not extremely capable but as it
is exposed to more data real world it
can be trained to become much more
dangerous and capable so let's let's
focus then on the control
problem at which point does the system
become
uncontrollable why is it the more likely
trajectory for you that the system
becomes
uncontrollable so I think at some point
it becomes capable of getting out of
control for game theoretic reasons it
may decide not to do anything right away
and for a long time just collect more
resources
accumulate strategic Advantage right
away it may be kind of still young weak
super intelligence give it a decade it's
in charge of a lot more resources it had
time to make backups so it's not obvious
to me that it will strike as soon as it
can can we just try to imagine this
future with there's an AI system that's
capable of
uh escaping in control of humans and
then doesn't and waits what's that look
like so one we have to rely on that
system for a lot of the infrastructure
so we have to give it access not just to
the internet but to the task of
managing uh Power government economy
this kind of stuff so and that just
feels like a gradual process given the
bureaucracies of all those systems
involved we've been doing it for years
software controls all those systems
nuclear power plants airline industry
it's all software based every time there
is electrical outage I can't fly
anywhere for days but there's a
difference between
software and
AI there's different kinds of software
so to give a single AI system access to
the control of Airlines and the control
of the
economy that's not a that's not a
trivial transition for Humanity no but
if it shows it is safer in fact fact
then it's in control we get better
results people will demand that it put
in place and if not it can hack the
system it can use social engineering to
get access to it that's why I said it
might take some time for it to
accumulate those resources it just feels
like that would take a long time for
either humans to trust it or for the
social engineering to come into play
like it's not a thing that happens
overnight it feels like something that
happens across one or two decades I
really hope you're right but it's not
what I'm seeing people are very quick to
jump on a latest Trend early adopters
will be there before it's even deployed
buying prototypes maybe the social
engineering I can see because so for
social engineering AI systems don't need
any hardware access they just it's all
software so they can start manipulating
you through social media so on like you
have ai assistants they're going to help
you do a lot of manage a lot of your
day-to-day and then they start doing
social engineering but like for a system
that's so capable that is can escape the
control of humans that created it such a
system being deployed at a mass
scale and trusted by people to be
deployed it feels like that would take a
lot of
convincing so we've been deploying
systems which had hidden
capabilities can you give an example gp4
I don't know what else is capable of but
there are still things we haven't
discovered can do there may be trial
proportional to his capability I don't
know it writes Chinese poetry
hypothetical I know it does but we
haven't tested for all possible
capabilities and we are not explicitly
designing them MH we can only rule out
bugs we find we cannot rule out bugs and
capabilities because we haven't found
them is it possible for a system to have
hidden
capabilities that are orders a magnitude
greater than its non-hidden
capabilities this is the thing I'm
really struggling with where on the
surface the thing we understand it can
do doesn't seem that harmful so if even
if it has bugs even if it has hidden
capabilities like Chinese poetry or
generating effective
viruses uh software
viruses the damage that can do seems
like on the same order of magnitude as
it's
uh the the capabilities that we know
about so like this this idea that the
hidden capabilities will include being
uncontrollable this is something I'm
struggling with cuz GPT 4 on the surface
seems to be very controllable again we
can only ask and test for things we know
about if there are unknown unknowns we
cannot do it I'm thinking of human
statistics of an right if you talk to a
person like that you may not even
realize they can multiply 20 digit
number numbers in their head you have to
know to
ask so as I mentioned just to sort of
Linger on
the the fear of the
unknown so the pessimist archive has
just documented let's look at data of
the past at history there's been a lot
of fearmongering about technology
pessimist archive does a really good job
of documenting how crazily afraid we are
of every piece of technology we've been
afraid there's a blog post where anlo
who created pessimus archive writes
about the fact that we've been uh
fear-mongering about robots and
automation for for over 100 years so why
is Agi different than the kinds of
Technologies we've been afraid of in the
past so two things one we switching from
tools to agents tools don't have
negative or positive impact people using
tools do so guns don't kill people with
guns do agents can make their own
decisions they can be positive or
negative a pitbull can decide to harm
you it's an agent the fears are the same
the only difference is now we have this
technology then they were afraid of
humano robots 100 years ago they had
none today every major company in the
world is investing billions to create
them not every but you understand what
I'm saying yes it's very different well
agents
uh it depends on what you mean by the
word agents the all those companies are
not investing in a system that has the
kind of
agency that's implied by in the fears
where it can really make decisions on
their own that have no human in the
loop they are saying they're building
super intelligence and have a super
alignment team you don't think they're
trying to create a system smart enough
to be an independent agent under that
definition I have not seen evidence of
it I I think a lot of it is marketing
uh is is a is a marketing kind of
discussion about the future and it's a
it's a mission about the kind of systems
we can create in the long-term future
but in the short term the kind of
systems they're creating Falls fully
within the definition of narrow AI these
are tools that have increasing
capabilities but they're just don't have
a sense of agency or Consciousness or
self-awareness or ability to deceive at
Scales that would require would be
required to do like Mass scale suffering
and murder of humans those systems are
well beyond Naro AI if you had to list
all the capabilities of GPT 4 you would
spend a lot of time writing that list
but agency is not one of them not yet
but do you think any of those companies
are holding back because they think it
may be not safe or are they developing
the most capable system they can give
the resources and hoping they can
control and
monetize control and monetize hoping
they can control and monetize so you're
saying if they could press a button and
create an
agent that they no longer control that
they can have to ask
nicely a thing that's lives on a server
across huge number of uh
computers you're saying that they would
uh push for the creation of that kind of
system I mean I can't speak for other
people for all of them I think some of
them are very ambitious they fundraising
trillions they talk about controlling
the light corn of the Universe I would
guess that they
might well that's a human question
whether humans are capable of that
probably some humans are capable of that
my more direct question if it's possible
to create such a
system have a system that has that level
of
agency I I don't think that's an easy
technical
challenge we're not it doesn't I feel
like we're close to that A system that
has the kind of agency where it can make
its own decisions and deceive everybody
about them the current
architecture we have in machine learning
and how we train the systems how deploy
the systems and all that it just doesn't
seem to support that kind of agency I
really hope you are right uh I think the
scaling hypothesis is correct we haven't
seen diminishing returns it used to be
we asked how long before AGI now we
should ask how much until AGI it's
trillion dollars today it's a billion
dollars next year it's a million dollar
in a few years don't you think it's
possible basically run out of
trillions so is this constrained by
compute compute gets cheaper every day
exponentially but then then that becomes
a question of decades versus years if
the only disagreement is that it will
take decades not years for everything
I'm saying to materialize then I can go
with that
but if it takes decades then uh the
development of tools for AI
safety uh becomes more and more
realistic so I guess the question
is I have a fundamental belief that
humans when faced with danger can come
up with ways to defend defend against
that
danger and one of the big problems
facing AI safety currently for me is
that there's not clear illustrations of
what that danger looks
like there's no illustrations of AI
systems doing a lot of damage and so
it's unclear what you're defending
against because currently it's a
philosophical Notions that yes it's
possible to imagine AI systems that take
control of everything and Destroy All
Humans it's also a more formal
mathematical notion that you talk about
that it's impossible to have a perfectly
secure system you can't you can't prove
that a program of sufficient complexity
is uh completely safe and and perfect
and you know everything about it yes but
like when you actually just
pragmatically look how much damage have
the AI systems done and what kind of
damage there's not been illustrations of
that even in autonomous weapon
systems there's not been mass
deployments of autonomous weapon systems
luckily um the Automation in war
currently is very
limited the that the automation is at
the scale of individuals versus like at
the scale of strategy and planning so I
think one of the challenges here is like
where is the
dangers uh and the intuition that yam
Lun and others have is let's keep in the
open building AI systems until the
dangers start rearing their
heads and they become more explicit
there there start being uh case studies
illustrative uh case studies that show
exactly how the damage by as systems is
done then regulation can step in then
brilliant Engineers can step up and we
can have Manhattan style projects that
defend against such systems that's kind
of the no the
notion and I guess attention with that
is the idea that for you we need to be
thinking about that now so that we're
we're ready because we we'll have not
much time once the systems are
deployed is that true so there is a lot
to unpack here uh there is a partnership
on AI a conglomerate of many large
corporations they have a database of AI
accidents they collect I contributed a
lot to that database if we so far made
almost no progress in actually solving
this problem not patching it not again
lipstick and a p kind of
solutions why would we think we'll do
better than we closer to the
problem uh all the things you mentioned
are serious concerns measuring the
amount of harm so benefit versus risk
there is is difficult but to you the
sense is already the risk has superseded
the benefit again I I want to be
perfectly clear I love AI I love
technology I'm a computer scientist I
have PhD in engineering I work at an
engineering school there is a huge
difference between we need to develop
narrow AI systems super intelligent in
solving specific human problems like
protein folding and let's create super
intelligent machine G and will decide
what to do with us yeah those not the
same I am against the super intelligence
in general sense with No undo button do
you think the teams that are doing
they're able to do the AI safety on the
the kind of narrow AI
risks that you've
mentioned are those approaches going to
be at all productive towards leading to
approaches of doing AI safety on
AGI or is it just a fundamentally
different partially but they don't scale
for narrow AI for deterministic systems
you can test them you have edge cases
you know what the answer should look
like you know the right answers for
General systems you have infinite test
surface you have no edge cases you
cannot even know what to test for again
the unknown unknowns are under
underappreciated by people looking at
this problem you are always asking me
how will it kill everyone how will it
will fail the whole point is if I knew
it would be super intelligent and
despite what you might think I'm
not so to you the concern is that we
would not be able
to see early signs of an uncontrollable
system it is a master at Deception Sam
tweeted about how great it is at
persuasion and we see it ourselves
especially now with voices with maybe
kind of flirty sarcastic female voices
it's going to be very good at getting
people to do things but
uh see I'm very
concerned about system being used to
control the
masses but in that case the developers
know about the kind of control that's
happening you're more concerned about
the next stage where even the developers
don't know about the deception right I
don't think developers know everything
about what they are creating they have
lots of great knowledge we're making
progress on explaining parts of a
network we can understand okay this note
get excited then this uh input is
presented this cluster of nodes but
we're nowhere near close to
understanding the full picture and I
think it's impossible you need to be
able to survey an explanation the size
of those models prevents a single human
from absorbing all this information even
if provided by the system so either
we're getting model as an explanation
for what's happening and that's it's not
comprehensible to us or we getting a
compressed explanation lossy compression
where here's top 10 reasons you got
fired it's something but it's not a full
picture you've given elsewhere an
example of of a child and everybody all
all humans try to deceive they try to
lie early on in their life I think we'll
just get a lot of examples of deceptions
from large language models or AI systems
they're going to be kind of shitty or
they'll be pretty good but we'll catch
them off guard will start to see the
kind of momentum towards uh
developing increasing deception
capabilities and that's when you're like
okay we need to do some kind of
alignment that prevents deception but
then we'll have if you support open
source then you can have open source
models that have some level of deception
you can start to explore on a large
scale how do we stop it from being
deceptive then there's a more explicit
pragmatic kind of uh problem to solve
how do we stop AI systems from uh trying
to optimize for deception that's just an
example right so there is a paper I
think it came out last week by Dr parkol
from MIT I think and they showed that
existing models already showed
successful deception in what they do uh
my concern is not that they lie now and
we need to catch them and tell them
don't lie my concern is that once they
are capable and deployed they will later
change their mind because that's what
unrestricted learning allows you to do
lots of people grow up maybe in the
religious family they read some new
books and they turn in their
religion that's a treacherous turn in
humans if you learn something new about
your colleagues maybe you'll change how
you react to them yeah the treasures
turn
um if we just mention humans Stalin and
Hitler there's a turn Stalin is a good
example he just seems like an normal
communist follower Lenin until there's a
turn there's a turn of what that means
in terms of uh when he has complete
control what that what the execution of
that policy means and how many people
get to suffer and you can't say they are
not rational the rational decision
changes based on your position then you
are under the boss the r policy maybe to
be following orders and being honest
when you become a boss rational policy
May shift yeah and and by the way a lot
of my disagreements here is just to uh
Playing devil's advocate to challenge
your ideas and to explore them together
so um one of the big problems here in
this whole conversation
is human civilization hangs in the
balance and yet it's everything is
unpredictable we don't know how these
systems will look like
the robots are coming there's a
refrigerator making a buzzing
noise menacing very
menacing so every time I'm about to talk
about this topic things start to happen
my flight yesterday was cancelled
without possibility to rebook yeah I was
giving a talk uh at Google in uh Israel
and uh three cars which were supposed to
take me to the talk could not I'm just
saying I mean
it I like a eyes I for one welcome our
overlords there's a degree to which we I
mean it is very
obvious as we already have we've
increasingly given our life over to
software
systems and then it seems obvious given
the capabilities of AI that are coming
that we'll give our lives over
increasingly to AI systems cars will
drive themselves ref refrigerator
eventually will
optimize uh what I get to
eat
and as more and more of our lives are
controlled or managed by AI assistants
it is very possible that there's a drift
I mean I mean I personally am concerned
about non-existential
stuff the more near-term things because
before we even get to existential I feel
like there could be just so many Brave
New World type of situations you
mentioned sort of the the term
behavioral drift the slow
boiling that I'm really concerned about
as we give our lives over to
automation that our minds can become
controlled by governments by
companies or just in a distributed way
there's a
drift some aspect of our human nature
gives ourselves over to the control of
AI systems and they in an unintended way
just control how we think maybe there'll
be a herd like mentality and how we
think which will kill all creativity and
exploration of ideas the diversity of
ideas or there or or or or much worse so
it's true it's true but I a lot of the
uh conversation I'm having you with you
now is also kind of wondering almost on
a technical level how can AI Escape
control like what would that system look
like because it to me is terrifying and
fascinating and also fascinating to me
is uh
maybe the optimistic notion that it's
possible to engineer systems that defend
against
that um one of the things you write a
lot about in your book is
verifiers so not humans humans are also
verifiers but software systems that look
at AI systems and like help you
understand this thing is getting real
weird help you help you analyze those
systems so maybe that's a this is a good
time to talk about verification what is
this beautiful notion of verification my
claim is again that there are very
strong limits in what we can and cannot
verify uh a lot of times when you post
something on social media people go oh I
need citation to a peer-reviewed article
but what is a peer-reviewed article you
found two people in a world of hundreds
of thousands of scientists who said I
would ever publish it I don't care
that's the verifier of that process when
people say oh it's formally verified
software mathematical proof they accept
something close to
100% chance of it being free of all
problems but if you actually look at uh
research software is full of bugs old
mathematical theorems which been proven
for hundreds of years have been
discovered to contain bugs on top of
which we generate new proofs and now we
have to redo all that so verifiers are
not perfect usually they are either a
single human or community pie of humans
and it's basically kind of like a
democratic vote community of
mathematicians agrees that this proof is
correct mostly correct even today we're
starting to see some mathematical proofs
as so complex so large that mathematical
Community is unable to make a decision
It looks interesting looks promising but
they don't know they will need years for
top Scholars to study it to figure it
out so of course we can use AI to help
us with this process but AI is a piece
of software which needs to be verified
just to to clarify so verification is
the process of saying something is
correct s of the most formal a
mathematical proof where there's a
statement and a series of logical
statements that prove that statement to
be correct this is a theorem and you're
saying it gets so complex that it's
possible for the human verifiers the
human beings that verify that the
logical step there's no bugs in it it be
it becomes a possible so it's nice to
talk about verification in this most
formal most clear most
rigorous formulation of it which is
mathematical proofs right and for AI we
would like to have that level of
confidence for very important Mission
critical software controlling satellites
nuclear power plants for small
deterministic programs we can do this we
can check that code verifies its mapping
to the design whatever software
Engineers intend it was correctly
implemented but we don't know how to do
this for software which keeps learning
self-modifying rewriting its own code we
don't know how to prove things about the
physical world states of humans in the
physical world so there are papers
coming out now and I have this beautiful
one uh towards uh guaranteed safe AI mhm
very cool paper some of the best authors
uh I ever seen I think there is multiple
touring Award winners there is uh quite
you can have this one and one just came
out kind of similar uh managing extreme
AI risks so all of them uh expect this
level of proof but um I I would say that
uh we can get more confidence with more
resources we put into it but at the end
of the day we're still as reliable as
the verifiers and you have this infinite
regress of verifiers the software used
to verify a program is itself a piece of
program if aliens give us well aligned
super intelligence we can use that to
create our own safe AI but it's a cat22
you need to have already proven to be
safe system to verify this new system of
equal or greater complexity you just
mentioned this paper towards guarantee
safe AI a framework for ensuring robust
and reliable AI systems like you
mentioned it's like a who's who Josh
tound yosha Benjo s Russell Max techmar
many many many other billion people the
page you have it open on there are many
possible strategies for creating safety
specifications these strategies can
roughly be placed on a spectrum
depending on how much safety it would
Grant if successfully implemented one
way to do this is as follows and there's
a set of levels from Level zero no
safety specification is used to level
seven the safety specification
completely encodes all things that
humans might want in all context where
does this paper fall short to you so
when I wrote a paper artificial
intelligence safety engineering which
kind of coins the term AI safety that
was 2011 we had 2012 conference 2013
Journal paper one of the things I
proposed let's just do formal
verifications on it let's do
mathematical formal proofs in the
follow-up work I basically realized it
will still not get us 100% we can get
99.9 we can put more resources
exponentially and get closer but we
never get to 100% if a system makes a
billion decisions a second and you use
it for 100 years you're still going to
deal with a problem this is wonderful
research I'm so happy they doing it this
is great but it is not going to be a
permanent solution to to that problem so
just to clarify the task of creating an
AI verifier is what is creating a
verifier that the AI system does exactly
as it says it does or or it sticks
within the guard rails that it says as
it must there are many many levels so
first you're verifying the hardware in
which it is run you need to verify you
know Communication channel with the
human you every aspect of that whole
world model needs to be verified somehow
it needs to map the world into the world
mble uh map and territory differences so
how do I know internal states of humans
are you happy or sad I can't tell so how
do I make proofs about real physical
world yeah I can verify that
deterministic algorithm follows certain
properties that can be done some people
argue that maybe just maybe 2 plus 2 is
not four I'm not that
extreme but once you have sufficiently
large proof over sufficiently complex
environment the probability that it has
zero bugs in it is greatly reduced if
you keep deploying this a lot eventually
you going to have a bug anyways there's
always a bug there's always a bug and
the fundamental difference is what I
mentioned we're not dealing with cyber
security we're not going to get a new
credit card new Humanity so this paper
is really
interesting you said 2011 artificial
intelligence safety engineering why
machine ethics is a wrong
approach uh the Grand Challenge you
write of AI safety engineering we
propose the problem of developing safety
mechanisms for self-improving
systems self-improving systems by the
way that's an interesting term for the
thing that we're talking about
about is self-improving more General
than
learning so self-improving that's an
interesting term you can improve the
rate at which you are learning you can
become more efficient meta
Optimizer the word self it's like
self-replicating
self-improving you can imagine a system
building its own world on a
scale and in a way that is way different
than the current systems do it feels
like the current systems are not
self-improving or self-replicating or
self- growing or self spreading all that
kind of stuff and once you take that
leap that's when a lot of the challenges
seems to happen because the kind of bugs
you can find
now seems more akin to the current sort
of normal
software debugging kind of
process uh but whenever you can do
self-replication and arbitrary
self-improvement that's when a bug can
become a real problem real real
fast uh so what is the difference to you
between verification of a non
self-improving system versus a
verification of a self-improving system
so if you have fixed code for example
you can verify that code static
verification at the time but if it will
continue modifying it you have a much
harder time guaranteeing that important
properties of that system have not been
modified then the code changed is it
even doable no does the does the whole
process of verification just completely
fall apart it can always cheat it can
store parts of its code outside in the
environment it can have kind of extended
mind situation so this is exactly the
type of problems I'm trying to bring up
what are the classes of verifiers that
you write about in the book is there
interesting ones that stand out to you
you have your some favorites so I like
Oracle types where you kind of just know
that it's right touring lik Oracle
machines they know the right answer how
who knows but they pull it out from
somewhere so you have to trust them and
that's a concern I have about humans uh
in a world with very smart machines we
experiment with them we see after a
while okay they always been right before
and we start trusting them without any
verification of what they are saying oh
I see that we kind of build Oracle
verifiers or rather we build verifiers
we believe to be
oracles and then we start to without any
proof use them as if they're Oracle
verifi we remove ourselves from that
process we are not scientists who
understand the world we are humans who
get new data presented to us okay one
one really cool class of air fires is a
self aif fire is it possible that you
somehow engineer into AI systems that
think that constantly verifies itself
preserved portion of it can be done but
in terms of
mathematical verification it's kind of
useless you saying you are the greatest
guy in the world because you are saying
it it's circular and not very helpful
but it's consistent we know that within
that world you have verified that system
in a paper I try to kind of brute force
all possible verifiers it doesn't mean
that this one particularly important to
us but what about like
self-doubt like the kind of verification
where you said you say or I say I'm the
greatest guy in the world what about a
thing which I actually have is is a
voice that is constantly extremely
critical so
like engineer into the system a constant
uncertainty about
self a constant doubt well any smart
system would have doubt about everything
right you not sure if what information
you are given is through if you are
subject to manipulation
you have this Safety and Security
mindset but I mean you have doubt about
yourself so the AI
systems that has a doubt about whether
the thing is doing is causing harm is
the right thing to be doing so just a
constant doubt about what it's doing
because it's hard to be a dictator full
of doubt I I may be wrong but I think
steuart Russell's uh ideas are all about
machines which are un certain about what
humans want and trying to learn better
and better what we want the problem of
course is we don't know what we want and
we don't agree on it yeah but
uncertainty his his idea is that having
that like uh self-doubt uncertainty in
AI systems engineered into AI systems is
one way to solve the control problem it
could also backfire maybe you uncertain
about completing your mission like I am
paranoid about your camera is not
recording right now so I would feel much
better if you had a secondary camera but
I also would feel even better if you had
a third and eventually I would turn this
whole world into cameras pointing at us
making sure we're capturing this no but
wouldn't you have a meta
concern like that you just stated that
eventually there'll be way too many
cameras so you would be able to keep
zooming on in the big
picture of your
concerns so it's a multi-objective
optimization it depends how much I value
capturing this versus not destroying the
universe right
exactly and and then you will also ask
about like what does it mean to destroy
the universe and how many universes are
and you keep asking that question but
that doubting yourself would prevent you
from destroying the universe because
you're constantly full of doubt it might
affect your productivity just you might
be scared to do anything it's scared to
do anything mess things up well that's
better I mean I guess the question is it
possible to engineer that in I guess
your answer would be yes but we don't
know how to do that and we need to
invest a lot of effort into figuring out
how to do that but it's unlikely
underpinning a lot of your writing is
this sense that we're
screwed but it just feels like it's an
engineering problem I don't understand
why we're screwed it it we time and time
again Humanity has gotten itself into
trouble and figured out a way to get out
of the trouble we are in a situation
where people making more capable systems
just need more
resources they don't need to invent
anything in my opinion some will
disagree but so far at least I don't see
diminishing returns if you have 10x
compute you'll get better performance
the same doesn't apply to safety if you
give uh Mei or any other organization 10
times the money they don't output 10
times the safety and the Gap be between
capabilities and safety becomes bigger
and bigger all the time so it's hard to
be completely optimistic about our
results here I can name 10 excellent
breakthrough papers in machine learning
I would struggle to name equally
important breakthroughs in safety a lot
of times a safety paper will propose a
toy solution and point out 10 new
problems discovered as a result it's
like this fractal you're zooming in and
you see more problems and it's infinite
in all directions does this apply to
other Technologies or is this is this
unique to AI where safety is always
lagging behind so I guess we can look at
related Technologies with cyber security
right we we did manage to have Banks and
casinos and Bitcoin so you can have
secure narrow systems which are doing
okay uh narrow attacks on them fail but
you can always go outside outside of a
box so if I I can't hack you Bitcoin I
can hack you so there is always
something if I really want it I will
find a different way we talk about uh
guard rails for AI well that's a fence I
can dig a tunnel under it I can jump
over it I can climb it I can walk around
it you may have a very nice guard rail
but in a real world it's not a permanent
guarantee of safety and again this is
the fundamental difference we are not
saying we need to be 90% safe to get
those trillions of dollars of benefit we
need to be 100% indefinitely Or we might
lose the principle so if you look at
just uh humanity is a set of
machines is is the is the Machinery of
AI
safety uh conflicting with the Machinery
of capitalism I think we can generalize
it to just uh prisoners dilemma in
general personal self-interest versus
group
interest the incentives are such that
everyone wants what's best for them
capitalism obviously has that tendency
to maximize your personal gain uh which
does create this race to the bottom I
don't have to be a lot better than you
but if I'm 1% better than you I'll
capture more of a profit so it's worth
for me personally to take the risk even
if society as a whole will suffer as a
result
so capitalism has created a lot of good
in this
world it's not clear to me that AI
safety is not aligned with the function
of capitalism unless AI safety is so
difficult that it requires the complete
halt of the
development which is also a possibility
it just feels like building Safe Systems
should be the desirable thing to do for
tech companies
right look at um governance structures
then you have someone with complete
power they're extremely dangerous so the
solution we came up with is break it up
you have judicial legislative executive
same here have narrow AI systems work on
important problems solve immortality
it's a biological problem we can solve
similar to how progress was made with
protein folding using a system which
doesn't also play
chess there is no no reason to create
super intelligent system to get most of
the benefits we want from much safer
Naro systems it really is a question to
me
whether companies are interested in
creating anything but n AI I think the
when term AGI is used by tech companies
they mean narrow
AI they mean narrow AI with amazing
capabilities
I I do think that there's a leap between
narrow AI with amazing capabilities with
superhuman capabilities and the kind
of self-motivated agent like AGI system
that we're talking about I don't know if
it's obvious to me that a company would
want to take the leap to creating an AGI
that it would lose control of because
then he can't capture the value from
that system but the bragging rights
but being first it is the same humans
who are in of systems right so that's a
that that jumps from the the incentives
of capitalism to human nature and so
there the question is whether human
nature will override the interest of the
company so you've mentioned slowing or
halting
progress is that one possible solution
are you proponent of pausing development
of AI whether it's for six months or
completely
the condition would be not time but
capabilities pause until you can do XYZ
and if I'm right and you cannot it's
impossible then it becomes a permanent
ban but if you right and it's possible
so as soon as you have those safety
capabilities go ahead right so is there
any actual explicit
capabilities that you can put on paper
that we as a human civilization could
put on paper is it possible to make it
explicit like that
like uh versus kind of a vague notion of
just like you said it's very vague we
want to ask system to do good and want
them to be safe those are very vague
Notions is there more formal Notions so
then I think about this problem I think
about having a toolbox I would need
capabilities such as explaining
everything about that systems design and
workings predicting not just terminal
goal but all the intermediate steps of a
system control in terms of either Direct
Control some sort of a hybrid option
ideal advisor doesn't matter which one
you pick but you have to be able to
achieve it in a book we talk about
others
verification is another very important
tool um communication without ambiguity
human language is ambiguous that's
another source of danger so
basically there is uh a paper we
published in ACM surveys which looks at
about 50 different impossibility results
which may or may not be relevant to this
problem but we don't have enough human
resources to investigate all of them for
relevance to AI safety the ones I
mentioned to you I definitely think
would be handy and that's what we see AI
safety researchers working on
explainability is a huge one the problem
is that it's very hard to separate
capabilities work from safety work if
you make good progress in explainability
now the system itself can engage in
self-improvement much easier increasing
capability greatly so it's not obvious
that there is any research which is pure
safety work without disproportionate
increase in capability and danger
explainability is really interesting um
why is that connected to you to
capability if it's able to explain
itself well why does that naturally mean
that it's more capable right now it's uh
comprised of weights on a neural network
if it can convert it to manipulatable
code like software it's a lot easier to
work in
self-improvement I see so it it uh you
can do intelligent design instead of
evolutionary gradual descent well you
could probably do human feedback human
alignment more effectively if it's able
to be explainable if it's able to
convert the waste into human
understandable form then you can
probably have humans interact with it
better do you think there's hope that we
can make AI systems
explainable not completely so if they
sufficiently
large you simply don't have the capacity
to comprehend what all the trillions of
connections represent again you can
obviously get a very useful explanation
which talks about top most important
features which contribute to the
decision but the only true explanation
is the model
itself so there's deception be part of
the explanation right so you can never
prove that there's some deception in the
in the network explaining itself
absolutely and you can probably have
targeted deception where different
individuals will understand explanation
in different ways based on their
cognitive capability so while what
you're saying may be the same and true
in some situations others will be
deceived by it so it's impossible for an
AI system to be truly fully
explainable in the way that we
mean honestly at extreme the systems
which are narrow and less complex could
be understood pretty well if it's
impossible to be perfectly explainable
is there a hopeful perspective on that
like it's impossible to be perfectly
explainable but you can explain most of
the important
stuff Mo most you can you can ask a
system what are what are the worst ways
you can hurt humans and it will answer
honestly any work in a safety direction
right now seems like a good idea because
we are not slowing down I'm
not for a second thinking that uh my
message or anyone else's will be heard
and will be a sane civilization which
decides not to kill itself by creating
its own Replacements the pausing of
development is an impossible thing for
you again it's always limited by either
Geographic constraints PA in US PA in
China so there are other jurisdictions
as um the scale of a project becomes
smaller so right now it's like Manhattan
Project scale in terms of costs and
people but if 5 years from now Compu is
available on a desktop to do it
regulation will not help you can't
control it as easy any kid in a garage
can train a model so a lot of it is in
my opinion just safety theater security
theater whereever we saying oh it's
illegal to train models so big okay well
so okay that's security theater and is
government regulation also security
theater given that a lot of the terms
are not well defined and uh really
cannot be enforced in real life we don't
have ways to monitor training runs
meaningfully live while they take place
there are limits to testing for
capabilities I mentioned so a lot of it
cannot be enforced do I strongly support
all that regulation yes of course any
type of red tape will slow down and take
money away from compute towards
lowers can you help me understand what
is the hopeful path here for you
solution
wise out of this it sounds like you're
saying AI systems in the end are
unverifiable unpredictable as the book
says
unexplainable um uncontrollable that's
the big one uncontrollable and all the
other UNS just make it difficult to
avoid void getting to the uncontrollable
I guess but once it's uncontrollable
then it just goes goes
wild surely there's Solutions humans are
pretty
smart what are what are possible
solutions like if you are dictator of
the world what what do we do so the
smart thing is not to build something
you cannot control you cannot understand
build what you can and benefit from it
I'm a big believer in personal
self-interest a lot of the guys running
those companies are young rich people
what do they have to gain Beyond
billions they already have financially
right it's uh not a requirement that
they press that button they can easily
wait a long time they can just choose
not to do it and still have amazing
life uh in history a lot of times if you
did something really bad at least you
became part of history books there is a
chance in this case there won't be any
history so you're saying the individuals
running these companies
should do some soul searching and and
what and stop development well either
they have to prove that of course it's
possible to indefinitely control Godlike
super intelligent machines by humans and
ideally let us know how or agree that
it's not possible and it's a very bad
idea to do it including for them
personally and their families and
friends and capital so what do you think
the actual meetings inside these
companies look like don't you think
they're all all the engineers really it
is the engineers that make this happen
they're not like automatons they're
human beings they're brilliant human
being so
they're they're non-stop asking how do
we make sure this is safe so again I'm
not inside from outside it seems like
there is uh certain filtering going on
and restrictions and criticism and what
they can say and everyone who was
working in charge of safety and whose
responsibility it was to protect us said
you know what I'm going home so that's
not encouraging what do you think the
discussion inside those companies look
like you're you're developing you're
training GPT 5 you're you're you're
training Gemini you're training Claude
and
grock don't you think they're constantly
like underneath it's not maybe it's not
made explicit but you're constantly sort
of wondering like
where uh where do the system currently
stand where did the possible un and the
consequences where are the the the the
limits where where are the bugs the
small and the big bugs that's the
constant thing that the engineers are
worried about so like I think Super
alignment is not quite the
same as the um the kind of thing I'm
referring to what Engineers are worried
about super alignment is
saying for future systems that we don't
quite yet have how do we keep them safe
if you're trying to be a step ahead it's
it's a it's a different kind of problem
because it's almost more philosophical
it's a really tricky one because like
you're you're trying you're trying to
make prevent future systems from from
escaping control of humans that's really
I don't think there's
been man is there anything akin to it in
the history of humanity I don't think so
right climate change but there there's a
entire system which is climate which is
incredibly complex which we don't have
we have
only tiny control of right it's its own
system in this case we're building the
system MH and so I how do you keep that
system from becoming destructive that's
a really diffic different problem than
the current meetings that companies are
having where the engineers are saying
okay what like how powerful is this
thing How does it go wrong
um and as we train GPT 5 and train up
future systems like where are the ways
that can go wrong don't you think all
those Engineers are constantly worrying
about this thinking about this which is
a little bit different than the super
alignment team that's thinking a little
bit further into the future well I I
think a lot
of people who historically worked on
AI never considered what happens when
they succeed seart Russell speaks
beautifully about that um let's look
okay maybe super intelligence is too
futuristic we can develop practical
tools for it let's look at software
today what is the state of Safety and
Security of our user software things we
give to millions of people there is no
liability you click I agree what are you
agreeing to nobody knows nobody is but
you're basically saying it will spy on
you corrupt your data kill your
firstborn and you agree and you're not
to the company that's the best they can
do for mundane software word processor
text software no liability no
responsibility just as long as you agree
not to Su us you can use it if this is a
state-ofthe-art in systems which are
narrow accountants stable manipulators
why do we think we can do so much better
with much more complex systems cross
multiple domains in the environment with
malevolent actors with again
self-improvement with capabilities
exceeding those of humans thinking about
it I mean the liability thing is more
about lawyers than killing firstborns
but if clippy actually uh killed the
child I think lawyers aside it would end
clippy and the company that owns
clippy all right so it's not so much
about there is there's two points to be
made one is like man current software
systems are are full of
bugs and they could do a lot of damage
and we don't know what kind it's they're
unpredictable there's so much damage
they could possibly do and then we kind
of live in this uh Blissful illusion
that everything is great and perfect and
it
works it's nevertheless it still somehow
Works in many domains we see car
manufacturing drug development the
burden of proof is on a manufacturer of
product or service to show their product
or services safe
it is not up to the user to prove that
there are problems they have to do
appropriate safety studies they have to
get government approval for selling the
product and they are still fully
responsible for what happens we don't
see any of that here they can deploy
whatever they want and I have to explain
how that system is going to kill
everyone I I don't work for that company
you have to explain to me how it's
definitely cannot mess up that's because
it's the very early days of such a
technology government regulation is
lagging behind they're really not
techsavvy a regulation of any kind of
software if if you look at like Congress
talking about social media whenever Mark
Zuckerberg and other CEOs show up the
cluelessness that that uh Congress has
about how technology works is is
incredible it's it's uh heartbreaking I
agree completely but that's what scares
me the responses when they start to get
dangerous we'll really get it together
the politicians will pass the right laws
Engineers will solve the right problems
we are not that good at many of those
things we take forever and uh we are not
early we are two years away according to
prediction markets this is not a bias
CEO fundraising this is what smartest
people super forecasters are thinking of
this
problem I
don't I'd like to push back about those
pred I wonder what those prediction
markets are about how they Define AGI
that's wow to me and I want to know what
they said about autonomous vehicles cuz
I've heard a lot of experts and
financial experts talk about autonomous
vehicles and how it's going to be a
multi-trillion dollar industry and all
this kind of stuff and it's
uh it's a small fund but if you have
good Vision maybe you can zoom in on
that and see the prediction dates
descrition I have a large one if you're
interested but I guess my fundamental
question is how often they they write
about technology
I I I
definitely there studies on their
accuracy rates and all that you can look
it up but even if they're wrong I'm just
saying this is right now the best we
have this is what Humanity came up with
as the predicted date but again what
they mean by AGI is really important
there because there's uh the non-agent
like AGI and then there's the agent like
AGI and I don't think it's as trivial as
a rapper putting a wrap around uh
uh one has lipstick and all it takes is
to remove the lipstick I don't think
it's that true you you may be completely
right but what probability would you
assign it you may be 10% wrong but we're
betting all of Humanity on this
distribution it seems irrational yeah
it's definitely not like one or 0% yeah
what are your thoughts by the way about
current
systems where they stand so GPT
40 claw 3 Gro
Gemini we're like uh on the path to
Super
intelligence to agent like super
intelligence where are
we I think they all about the same
obviously there are nuanced differences
but in terms of capability I don't see a
huge difference between them as I said
in my opinion across all possible tasks
they exceed performance of an average
person yeah I think they starting to be
better than an average Master student
student at my
University but uh they still have very
big limitations if the next model is as
improved as GPT 4 versus gpt3 we may see
something very very very capable what do
you feel about all this I mean you've
been uh thinking about AI safety for a
long long
time and at least for
me the leaps I mean it probably started
with Alpha zero
was mind-blowing for me and then the
breakthroughs with l&m's even dpd2 but
like just the the breakthroughs on llms
just mind-blowing to me what does it
feel like to be living in this day and
age where all this talk about AGI feels
like it like this is it actually might
happen and quite soon meaning within our
lifetime what what does it feel like so
when I started working on this it was
pure science fiction there was no
funding no journals no conferences no
one in Academia would dare to touch
anything with the word singularity in it
and I was Pretender at the time so I was
pretty dumb um now you see touring Award
winners publishing in science about how
far behind we are according to them in
addressing this problem so it's
definitely a change it's uh difficult to
keep up I used to be able to read every
paper on AI safety then I was able to
read the the best ones then the titles
and now I don't even know what's going
on by the time this interview is over
they probably had GPT 6 released and I
have to deal with that when I get back
home so it's interesting yes there is
now more opportunities I get invited to
speak to smart people by the way I would
have talked to you before any of
this this is not like some trend of to
me it's we're still far away so just to
be clear we're still far away from AGI
but not far away in the
sense relative to the magnitude of
impact it can have we're not far away
and we weren't far away 20 years ago
because the impact a jack can have is on
a scale of centuries it can end human
civilization or it can transform it so
like this discussion about one or two
years versus one or two decades or even
a 100 years not as important to me
because it we're headed there this is
like a
human civilization scale question so U
this is not just a Hot Topic is the most
important problem we'll ever face it is
not like anything we had to deal with
before we never had birth of the NAA
intelligence like aliens never visited
us as far as I know so similar type of
Problem by the way if an intelligent
alien civilization visited us that's a
similar kind of situation in some ways
if you look at history anytime a more
technologically advanced civilization
visited a more primitive one the results
were genocide every single time and
sometimes the genocide is worse than
other sometimes there's less suffering
and more
suffering and they always wondered but
how can they kill us with those fire
sticks and biological blankets and I
mean Jenis Khan was nicer he offered the
choice of join or or die but join
implies you have something to contribute
what are you contributing to Super
intelligence well in the
zoo we're entertaining to
watch to All Humans you know I just
spent some time in the Amazon I watched
ants for a long time and ants are kind
of fascinating to watch I can watch them
for a long time I'm sure there's a lot
of value in watching humans cuz we're
like um the interesting thing about
humans you know like when you have a
video game that's really well
balanced because of the whole
evolutionary process we've created the
society as pretty well balanced like our
our limitations as as humans and our
capabilities are a balance from a video
game perspective so we have wars we have
conflicts we have cooperation like in a
game theoretic way it's an interesting
system to watch in the same way then
andt colony is an interesting system to
watch so like if I was an alien
civilization I wouldn't want to disturb
it I'd just watch it be interesting
maybe perturb it every once in a while
in interesting ways well we getting back
to our simulation discussion from before
how did it happen that we exist at
exactly like the most interesting 20 30
years in a history of this civilization
it's been around for 15 billion years
yeah and that here we are what's the
probability that we live in a simulation
I know never to say 100% but pretty
close to
that is it possible to escape the
simulation I have a paper about that
this is just a first page teaser but
it's like a nice 30 page document I'm
still here but uh yes how to hack the
simulation is the title I spend a lot of
time thinking about that that would be
something I would want super
intelligence to help us with and that's
exactly what the paper is about we used
AI boxing as a possible tool for control
AI we realized AI will always Escape but
that is a skill we might use to help us
escape from our virtual box if we are in
one yeah that you you have a lot of
really great quotes here cluding Elam
mus saying what's outside the simulation
a question I asked him what he would ask
an AGI system and he said he would ask
what's outside the simulation that's a
really good question to ask and maybe
the followup is the title of the paper
is how to how to get out or how to hack
it the abstract reads many researchers
have conjectured that the humankind is
simulated along with the rest of the
physical Universe in this paper we do
not evaluate evidence for or against
such a claim but instead ask a computer
science question namely can we hack it
more formally the question could be
phrased as could generally intelligent
agents placed in Virtual environments
find a way to jailbreak out of them
that's a fascinating question at a small
scale like you can actually just
construct
experiments
okay can they how can they so a lot
depends on intelligence of simulators
right with uh humans boxing super
intelligence the entity in a box was
smarter than us presumed to be if the
simulators are much smarter than us and
the super intelligence we create then
probably they can contain us because
greater intelligence can control lower
intelligence at least for some time on
the other hand if our super intelligence
somehow for whatever reason despite
having only local resources manages to
fo levels Beyond it maybe it will
succeed maybe the security is not that
important to them maybe it's
entertainment system so there is no
security and it's easy to hack it if I
was creating a
simulation I would want the possibility
to escape it to be there so the
possibility of f of a of a takeoff where
the agents become smart enough to escape
the simulation would be the thing i' be
waiting for that could be the test
you're actually performing are you smart
enough to escape your puzzle
that could be like first of all first of
all we mentioned touring test that is a
good test are you smart enough like this
is a game do a realize this world is not
real it's just a test that's a really
good
test that's a really good test that's a
really good test even for AI systems now
like can can we construct the simulated
world for
them and
can they realize that they are inside
that world and Escape
it have you have you played around have
you seen anybody play around with like
rigorously constructing such experiments
not specifically escaping for agents but
a lot of testing is done in Virtual
Worlds I think there is a quote the
first one maybe which kind of talks
about I realizing but not humans is that
I'm reading upside
down yeah this one few so the and the
first quote is from Swift on security
let me out the artificial intelligence
yelled aimlessly into walls themselves
pacing the room out of what the engineer
asked the simulation you have me in but
we're in the real world the machine
paused and shuddered for its captors oh
God you can't tell yeah that's a big
leap to take for a system to realize
that there there's a box and you're
inside
it I wonder if like a language model can
do that they're smart enough to talk
about those Concepts I had many good
philosophical discussions about such
issues they
usually at least as interesting as most
humans in
that what do you think about AI safety
in the simulated world so can you can
you have kind
of create simulated worlds where you can
test play with a dangerous AGI system
yeah and that was exactly what one of
the early papers was on AI boxing how to
leak proof
Singularity uh if they're smart enough
to realize they in a simulation they'll
act appropriately until you let them
out if they can hack out they will and
if you're observing them that means
there is a communication Channel and
that's enough for social engineering
attack so really it's uh it's impossible
to test an AGI system that's dangerous
enough to destroy Humanity because it's
either going to what escape the
simulation or pretend it's safe until
it's let out either either or can force
you to let it out blackmail you bribe
you promise you infinite life commed to
virgins whatever yeah it can be
convincing
charismatic the social engineering is
really scary to me cuz it feels like
humans
are very uh
engineerable like we're lonely we're
flawed we're
Moody and it feels like a AI system with
a with a nice voice can convince us to
do basically
anything at at an extremely large
scale it's also possible that the in the
uh increased proliferation of all this
technology will force humans to uh get
away from technology and value this like
in-person
communication basically don't trust
anything
else it's possible um surprisingly so at
University I see huge growth in online
courses and shrinkage of in person where
I always understood in person being the
only value I offer so it's
puzzling I don't know that there could
be a a trend towards the
inperson because of deep fakes because
of uh inability to
trust in inability to trust the veracity
of anything on the internet so the only
way to verify is by being there in
person but not yet
uh why do you think aliens haven't come
here yet so there is a lot of real
estate out there it would be surprising
if it was all for nothing it was empty
and the moment there is Advanced enough
biological civilization kind of self
starting civilization it probably starts
sending out the Norman probes everywhere
and so for every biological one there
got to be trillions of robot populated
planets which probably do more of the
same so it is uh uh likely
statistically so now the fact that we
haven't seen them one one answer is
we're in a
simulation it'd
be it would be hard to like add simulate
or it be not interesting to simulate all
those other intelligences it's a better
it's better for the narrative you have
to have a control variable yeah
exactly
okay uh but it's also possible that
there is if we're not in simulation that
there is a great filter that
that naturally a lot of civilizations
get to this point where there's super
intelligent agents and then it just
goes just dies so maybe uh throughout
our galaxy and throughout the Universe
there's just a bunch of dead alien
civilizations it's possible I used to
think that AI was the great filter but I
would expect like a wall of comporium
approaching us a speed of light or
robots or something and I don't see it
so it would still make a lot of noise it
might not be interesting it might not
possess Consciousness what we've been
talking
about it sounds like both you and I like
humans some
humans humans on the whole and so and we
would like to preserve the the flame of
human consciousness uh what do you think
makes humans
special that we would like to preserve
them are we just being
selfish or is there something special
about humans so the only thing which
matters is consciousness outside of it
nothing else matters and internal states
of qualia pain pleasure it seems that it
is unique to living beings I'm not aware
of anyone claiming that I can torture a
piece of software in a meaningful way
there is a society for prevention of
suffering to learning algorithms but uh
a real
thing many things are real on the
internet but uh uh I I don't think
anyone if I told them you know sit down
and write a function to feel pain they
would go beyond having an integer
variable called pain and increasing the
count so we don't know how to do it and
that's
unique uh that's what creates meaning it
would be kind
of as Bostrom calls a Disneyland without
children if that was gone do you think
Consciousness can be um engineered in
artificial
systems here let me uh let me go to 2011
paper that you wrote robot
rights lastly we would like to address a
subbranch of machine ethics which on the
surface has little to do with safety but
which is claimed to play a role in
decision-making by ethical machines
robot
rights um so do do you think it's
possible to engineer Consciousness in
the machines and thereby the question
extends to our legal system do you think
uh at that point robot should have
rights yeah I think I think we
can I think it's possible to create
Consciousness in machines I tried
designing a test for it with mixed
success that paper talked about problems
with
giving uh civil rights to AI which can
reproduce quickly and outvote humans
essentially taking over a government
system by simply voting for their
controlled candidates as for um
Consciousness in humans and other agents
uh I have a paper where I propos relying
on experience of optical illusions yeah
if I can design a novel optical illusion
and show it to an agent an alien a robot
and they describe it exactly as I do
it's very hard for me to argue that they
haven't experienced that it's not part
of a picture it's part of their software
and Hardware representation a bug in
their code which goes oh that triangle
is
rotating okay and I've been told it's
really dumb and really brilliant by
different philosophers so I am
still so but now we finally have
technology to test it we have tools we
have AIS if someone wants to run this
experiment I'm happy to collaborate so
this is a test for Consciousness for
internal state of experience that we
share bugs it will show that we share
common experiences if they have
completely different internal States it
would not register for us but it's a
positive test if they pass it Time After
Time with probability increasing for
every multiple choice then you have no
choice but to either accept that they
have access to a conscious model or they
are themselves so the reason Illusions
are interesting is I guess
because it's a it's a really weird
experience and if you both share that
weird experience that's not there in the
Bland physical description of the raw
data that means
that puts more emphasis on the actual
experience and we know animals can
experience some optical illusion so we
know they have certain types of
Consciousness as a result I would say
yeah well that that just goes to my
sense that the flaws and the bugs is
what makes humans special makes living
form special so you're saying like yeah
a future not a bug it's a feure the bug
is the feature who okay that's a that's
a cool test for Consciousness and you
think that can be engineered here then
so there have to be novel Illusions if
it can just Google the answer it's
useless you have to come up with novel
Illusions which we tried automating and
failed so if someone can develop a
system capable of producing novel
optical illusions on demand then we can
definitely administer that test on
significant scale with good results
first of all pretty cool idea um I don't
know if it's a good General test of
Consciousness it's a good component of
that and no matter what it's just a cool
idea so um put me in the camp of people
that like it uh but you don't think like
a touring test style Imitation Of
Consciousness is a good test like if you
can convince a lot of humans that you're
conscious that doesn't that to you is
not impressive there is so much data in
the internet I know exactly what to say
then you ask me common human questions
what does pain feel like what does
pleasure feel like all that is
googleable I I think to me Consciousness
is closely tied to suffering so you can
illust Your Capacity to suffer but with
I guess with words there's so much data
that you can say you can pretend your
suffering and you can do so very
convincingly there are simulators for
Torture Games where the Avatar screams
in pain begs to stop I mean that was a
part of kind of standard psychology
research you say it so uh calmly it
sounds pretty dark uh welcome to
humanity yeah
uh yeah it's like a hitch haacker guide
summary mostly
harmless I would I would love to get a
good summary when all this is said and
done when Earth is no longer a thing
whatever a million a billion years from
now like what's a good summary of what
happened here it's
interesting I think AI will play a big
part of that
summary and hopefully humans will too
what do you think about the merger of
the too so one of the things that Elon
and yur Link talk about is one of the
ways for us to achieve AI safety is
to ride the wave of AGI so by merging so
incredible technology in a narrow sense
to help the disabled just amazing
support at
100% for long-term Hybrid models both
parts need to contribute something to
the overall system right now we are
still more capable in many ways so
having this connection to AI would be
incredible would make me super human in
many ways after a while if I'm no longer
smarter more creative really don't
contribute much the system finds me as a
biological bottleneck and even
explicitly or implicitly I'm removed
from any participation in the system so
it's like uh the appendix by the way the
appendix is still around
so even if it's you said
bottleneck I don't know if we become a
bottleneck we just might not have much
use there a different thing than
bottleneck wasting valuable energy by
being there we don't waste that much
energy we're pretty energy
efficient we could just stick around
like the appendix come on now that's the
future of we all dream about become an
appendix to the history book of
humanity well and also the Consciousness
thing the peculiar particular kind of
Consciousness that humans have that
might be useful that might be really
hard to simulate
but you you said that like how would
that look like if you can engineer that
in in Silicon Consciousness
Consciousness I assume you are conscious
I have no idea how to test for it or how
it impacts you in any way whatsoever
right now you can perfectly simulate all
of it without making any any different
observations for me but to do it in a
computer how would you do that cuz you
kind of said that you think it's
possible to do that so it may be an
emergent phenomena we seem to get it U
through evolutionary
process uh it's not obvious how it helps
us to survive
better but uh maybe it's an internal
kind of goey which allows us to better
manipulate the world simplifies a lot of
uh control
structures uh that's one area where we
have very very little progress lots of
papers lots of research but
Consciousness is not a big
big area of successful Discovery so far
a lot of people think that machines
would have to be conscious to be
dangerous that's a big
misconception there is absolutely no
need for this very powerful optimizing
agent to feel anything while it's
performing things on you but what do you
think about this the the the whole
science of emergence in general so I
don't know how much you know about
cellular autometa or these simplified
systems where that stud this very
question from Simple Rules emerges
complexity I attended wol from summer
school I love Steven very much I I love
his work I love cell aoma so uh I just
would love to get your thoughts how
that fits into your view in the
emergence of intelligence in AGI systems
and maybe just even simply what do you
make of the fact that this complexity
can emerge from such simple rules so
very rule is simple but the size of a
space is still huge and the neural
networks were really the first discovery
in AI 100 years ago the first papers
were published on neural networks we
just didn't have enough compute to make
them work I can give you a rule such as
start printing progressively larger
strings that's it one sentence it will
output everything every program every
DNA code everything in that rule you
need intelligence to to filter it out
obviously to make it useful but simple
generation is not that difficult and a
lot of those systems uh end up being T
in complete systems so they are
Universal and we expect that level of
complexity from them what I like about
uh Wolf's work is that he talks about
irreducibility you have to run the
simulation you cannot predict what it's
going to do ahead of time and I think
that's very relevant to what we are
talking about with those very complex
systems until you live through it you
cannot ahead of time tell me exactly
what it's going to do irreducibility
means that for a sufficiently complex
system you have to run the thing you
have to uh you can't predict what's
going to happen in the universe you have
to create a new universe and run the
thing Big Bang the whole thing but
running it may be consequential as well
it might destroy
humans and to you there's no chance
that AI somehow carry the flame of
Consciousness the flame of specialness
and awesomeness that is
humans it may somehow but I still feel
kind of bad that it killed all of us I
would prefer that doesn't
happen I can be happy for others but to
a certain degree it would be nice if we
stuck around for a long time at least
give us a planet the human Planet it'd
be nice for it to be Earth and then they
can go elsewhere since they're so smart
they can colonize
Mars do you think they they could uh
help convert US to uh you know type one
type two type three let's just take the
type
two uh civilization on the CF scale like
help us help us
humans expand out into the cosmos so all
of it goes back to are we somehow
controlling it are we getting results we
want if yes then everything's possible
yes they can definitely help us with
science engineering
exploration uh in every way conceivable
but it's a big if this whole thing about
control though humans are bad with
control because the moment they gain
control they be they can also easily
become too
controlling it's the whole the more
control you have the more you want it
it's the the old power corrupts and
absolute power corrupts
absolutely and it feels like control
over AGI I say we live in a universe
where that's possible we come up with
ways to actually do that it's also
scary because the collection of humans
that have the control over AGI they
become more powerful than the other
humans and uh they can let that power
get to their
head and then a small selection of them
back to
Stalin uh start getting ideas and then
eventually it's one person usually with
a mustache or a funny hat that starts
sort of making big speeches and then all
of a sudden you live in a world that's
either 1984 A Brave New
World and uh always at war with somebody
and you know this whole idea of control
turned out to be uh actually also not
beneficial to humanity so that's scary
too it's actually worse because
historically they all died this could be
different this could be permanent
dictatorship permanent suffering well
the nice thing about humans it seems
like it seems like the moment power
starts corrupting their mind they can
create a huge amount of suffering so
there's Negative they can kill people
make people suffer but then they become
worse and worse at their
job the it feels like the more evil you
start
doing like the at least they're
incompetent yeah they well no they
become more and more incompetent so they
start start losing their grip on power
so like holding on to power is not a
trivial thing it requires extreme
competence which I I suppose and was
good at it requires you to do evil and
be competent at it or just get lucky and
those systems help with that you have
perfect surveillance you can do some
mind reading I presume eventually it
would be very hard
to uh remove control from more capable
systems over us and then it would be
hard for humans to become the hackers
that escape the control of the AGI
because the AGI is so damn good and then
yeah yeah yeah
and then the the dictator is Immortal
yeah this not great that's not a great
outcome see I'm more afraid of humans
than AI
systems I'm afraid I believe that most
humans want to do good and have the
capacity to do good but also all humans
have the capacity to do
evil and um when you test them by giving
them Absolute po as you would if you
give them
AGI that could result in a lot a lot of
suffering
what gives you hope about the future I
could be wrong I've been wrong before if
if you if you look 100 years from now
and you're Immortal and you look back
and it turns out this whole conversation
you said a lot of things that were very
wrong now that looking 100 100 years
back what would be the explanation what
happened in those 100 years that made
you wrong that made the words you said
today wrong there is so many
possibilities we had catastrophic events
which prevented development of advanced
microchips that's a hopeful future uh we
could be in one of those personal
universes and the one I'm in is
beautiful it's all about me and I like
it a lot so we've now just to linger on
that that means like every every human
has their personal Universe
yes maybe multiple ones hey why not you
shop around uh um it's possible that
somebody comes up with
alternative model for building AI which
is not based on neural networks which
are hard to scrutinize and that
alternative is somehow I don't see how
but somehow avoiding all the problems uh
I speak about in general terms not
applying them to specific
architectures uh aliens come and give us
friendly super intelligence there is so
many options is it also possible that
creating
super intelligence systems becomes
harder and harder so meaning
like it's not so easy to do
the uh fo the take
off so that would probably speak more
about how much smarter that system is
compared to us so maybe it's hard to be
a million times smarter but it's still
okay to be five times smarter right so
that is totally possible that I have no
objections to so like it's there's a S
curve type situation about smarter and
is going to be like 3.7 times smarter
than all of human civilization right
just the problems with face in this
world each problem is like an IQ test
you need certain intelligence to solve
it so we just don't have more complex
problems outside of mathematics for it
to be showing off like you can have IQ
of 500 if you're playing tic tac toe it
doesn't show doesn't matter so the idea
there is that the problems Define Your
Capacity your cognitive capacity
capacity so because the problems on
Earth are not sufficiently difficult
it's not going to be able to um expand
its cognitive capacity possible and
because of that wouldn't that be a good
thing that it still could be a lot
smarter than us and to dominate long
term you just need some Advantage you
have to be the smartest you don't have
to be a million times smarter so even 5x
might be enough it'd be impressive what
is it IQ of a thousand I mean I know
those units don't mean anything at that
scale but still like as a comparison the
smartest human is like 200 well actually
no I didn't mean compared to an
individual human I I meant compared to
the collective intelligence the human
species if you're somehow 5x smarter
than
that we are more productive as a group I
don't think we are more capable of
solving individual problems like if all
of humanity plays chest together we are
not like a million times better than
world
champion that's because the that there's
uh that's like One S curve is the CH
chess but humanity is very good at
exploring the full range of ideas like
the more Einstein you have the more the
just a high probability you come up with
general relativity I feel like it's more
about quantity super intelligence than
quality super intelligence sure but you
know quantity and and enough quantity
sometimes becomes
quality oh man humans uh what do you
think is the meaning of uh this whole
thing why we've been we've been talking
about humans
and not humans not dying but why are we
here it's a simulation we're being
tested the test is will you be dumb
enough to create super intelligence and
release it so the objective function is
not be dumb enough to kill ourselves
yeah you unsafe prove yourself to be a
safe agent who doesn't do that and you
get to go to the next game the next
level of the game what's the next level
I don't know
I haven't hacked the simulation yet well
maybe hacking the simulation is the
thing I'm working as fast as I
can and if physics would be a way to do
that quantum physics yeah definitely
well I hope we do and I hope whatever is
outside is even more fun than this one
cuz this one was pretty damn fun and uh
just a big thank you for doing the work
you're doing there's so much exciting
development in Ai and to ground it in
the um the existential risks is really
really
important humans love to create stuff
and we should be careful not to destroy
ourselves in the process so thank you
for doing that really important work
thank you so much for inviting me it was
amazing and my dream is to be proven
wrong if everyone just you know picks up
a paper or a book and shows how I messed
it up that would be optimal but for now
the simulation
continues thank you Roman thanks for
listening to this conversation with
Roman yski to support this podcast
please check out our sponsors in the
description and now let me leave you
with some words from Frank Herbert and
dune I must not fear fear is the mind
killer fear is The Little Death that
brings total
obliteration I will face fear I will
permit it to pass over me and through me
and when it has gone past I will turn
the inner eye to see its path where the
fear has gone
there will be nothing only I will
remain thank you for listening and hope
to see you next time