File TXT tidak ditemukan.
Roman Yampolskiy: Dangers of Superintelligent AI | Lex Fridman Podcast #431
NNr6gPelJ3E • 2024-06-02
Transcript preview
Open
Kind: captions
Language: en
if we create General super intelligences
I don't see a good outcome longterm for
Humanity so that is X risk existential
risk everyone's dead there is srisk
suffering risks where everyone wishes
they were dead we have also idea for IR
risk iyy risks where we lost our meaning
the systems can be more creative they
can do all the jobs it's not obvious
what you have to contribute to a world
where super intelligence exists of
course you can have all the variants you
mentioned where we are safe we are kept
alive but we are not in control we are
not deciding anything we like animals in
a zo there is
again possibilities we can come up with
as very smart humans and then
possibility is something a thousand
times smarter can come up with for
reasons we cannot
comprehend the following is a
conversation with Roman yski an AI
Safety and Security research
and author of a new book titled AI
unexplainable unpredictable
uncontrollable he argues that there's
almost 100% chance that AGI will
eventually destroy human civilization as
an aside let me say that I will have
many often technical conversations on
the topic of AI often with Engineers
building the state of the art AI systems
I would say those folks put the infamous
P Doom or the probability of a GI
killing all humans at around 1 to
20% but it's also important to talk to
folks who put that value at 70 80
90 and is in the case of Roman at
99.99 and many more 9es
per. I'm personally excited for the
future and believe it will be a good one
in part because of the amazing
technological innovation we humans
create but we must absolutely not do so
with blinders on ignoring the possible
risks including existential risks of
those
Technologies that's what this
conversation is about this is the Lex
Freedman podcast to support it please
check out our sponsors in the
description and now dear friends here's
Roman
yski what to you is the probability that
super intelligent AI will destroy all
human civilization what's the time frame
let's say 100 years in the next 100
years so the problem of controlling AI
or super intelligence in my opinion is
like a problem of creating a Perpetual
safety Machine by analogy with perpetual
motion machine is impossible yeah we may
succeed and do a good job with GPT 5 6 7
but they just keep improving learning
eventually self-modifying interacting
with the environment interacting with
malevolent
actors the difference between cyber
security narrow AI safety and safety for
General AI for super intelligence is
that we don't get a second chance with
cyber security somebody hacks your
account what's the big deal you get a
new password new credit card you move
on here if we're talking about
existential risks you only get one
chance so you're really asking me what
are the chances that will create the
most complex software ever on the first
try with zero bugs and it will continue
have zero bugs for 100 years or
more so there is an incremental
Improvement of systems leading up to AGI
to you it doesn't matter if we can keep
those safe there's going to be one level
of
system at which you cannot possibly
control
it I don't think we so far have made any
system safe at the level of capability
they display they already have made
mistakes we had accidents they've been
jailbroken I don't think there is a
single large language model today which
no one was successful at making do
something developers didn't intend it to
do but there's a difference between
getting it to do something unintended
getting it to do something that's
painful costly destructive and something
that's destructive to the level of
hurting billions of people or hundreds
of millions of people billions of people
or the entirety of human civilization
that's a big leap exactly but the
systems we have today have capability of
causing x amount of damage so then they
fail that's all we get if we develop
systems capable of impacting all of
humanity all of universe the damage is
proportionate what to you are the
possible ways that
such kind of mass murder of humans can
happen it's always a wonderful question
so one of the chapters in my new book is
about unpredictability I argue that we
cannot predict what a smarter system
will do so you're really not asking me
how super intelligence will kill
everyone you're asking me how I would do
it and I think it's not that interesting
I can tell you about the standard you
know nanotag synthetic bionuclear super
intelligence will come up with something
completely new completely
super we may not even recognize that as
a possible path to achieve that goal so
there is like a unlimited level of
creativity in terms of how humans could
be
killed but you know we could still
investigate possible ways of doing it
not how to do it but the at the end what
is the methodology that does it you know
shutting off the
power and then humans start killing each
other maybe because the resource is are
really constrained that there then
there's the actual use of weapons like
nuclear weapons or developing artificial
pathogens viruses that kind of stuff we
could still kind of think through that
and defend against it right there's a
ceiling to the creativity of mass murder
of humans here right the options are
limited they are limited by how
imaginative we are if you are that much
smarter that much more creative you are
capable of thinking across multiple
domains do no research in physics and
biology you may not be limited by those
tools if squirrels were planning to kill
humans they would have a set of possible
ways of doing it but they would never
consider things we can come up so are
you are you thinking about mass murder
and destruction of human civilization or
you thinking of with squirrels you put
them in a zoo and they don't really know
they're in a zoo if we just look at the
entire set of undesirable
trajectories majority of them are not
going to be death most of them are going
to be just like
uh things like Brave New World
where you know the squirrels are fed
dopamine and they're all like doing some
kind of fun activity and the sort of the
fire the soul of humanity is lost
because of the drug that's fed to it or
like literally in a zoo we're in a zoo
we're doing our thing we're like playing
a game of
Sims and the actual players playing that
game are AI systems those are all
undesirable because sort of sort of the
the free
will the fire of human consciousness is
dimmed through that process but it's not
killing humans so like are you thinking
about that or is the biggest concern
literally the extinctions of humans I
think about a lot of things so there is
X risk existential risk everyone's dead
there is srisk suffering risks where
everyone wishes they were that we have
also idea for IR risk iy risks where
lost their meaning the systems can be
more creative they can do all the jobs
it's not obvious what you have to
contribute to a world where super
intelligence exists of course you can
have all the variants you mentioned
where we are safe we are kept alive but
we are not in control we are not
deciding anything we like animals in a
zo there is
again possibilities we can come up with
as very smart humans and then
possibility is something a thousand
smarter can come up with for reasons we
cannot comprehend I would love to sort
of dig into each of those X risk srisk
and IR risk so can can you like Linger
on irisk what is that so Japanese
concept of viky guy you find something
which allows you to make money you are
good at it and the society says we need
it so like you have this awesome job you
are
podcaster gives you a lot of meaning you
have a good life I assume you happy
mhm that's what we want most people to
find to have for many
intellectuals it is their occupation
which gives them a lot of meaning I am a
researcher philosopher scholar that
means something to me in a world where
an artist is not feeling appreciated
because his art is just not competitive
with what is produced by machines or
writer or scientist will lose a lot of
that
and at the lower level we're talking
about complete technological
unemployment we're not losing 10% of
jobs we're losing all jobs what do
people do with all that free time what
happens then everything Society is built
on is completely modified in one
generation it's not a slow process where
we get to kind of figure out how to live
that new lifestyle but it's uh pretty
quick in that world can't humans just do
what humans currently do with chess play
each other have tournaments even though
AI systems are far superior at this time
in chess so we just create artificial
games or for us they're real like the
Olympics and we do all kinds of
different competitions and have fun
Focus maximize the fun and and uh let uh
the AI focus on the productivity it's an
option I have a paper where I try to
solve the value alignment problem for
multiple agents and the solution to
avoid compromise is to give everyone a
personal virtual Universe you can do
whatever you want in that world you
could be king you could be slave you
decide what happens so it's basically a
glorified video game where you get to
enjoy yourself and someone else takes
care of your needs and the substrate
alignment is the only thing we need to
solve we don't have to get 8 billion
humans to agree on anything mhm so okay
so what why is that not a likely outcome
why can't they systems create video
games for us to lose ourselves in each
each with an individual video game
Universe some people say that's what
happened we're in a simulation and we're
playing that video game and now we're
creating uh what maybe we're creating
artificial threats for ourselves to be
scared about cuz cuz fear is really
exciting it allows us to play the video
game more uh more vigorously and some
people choose to play on a more
difficult level with more con strange
some say okay I'm just going to enjoy
the game high privilege level absolutely
so okay what was that paper on
multi-agent value alignment personal
universes personal universes so so
that's one of the possible outcomes but
what what what in general is the idea of
the paper so it's looking at multiple
agents they're human AI like a hybrid
system whether it's humans and AI or is
it looking at humans or just so this is
intelligent agents in order to solve
value alignment problem I'm trying to
formalize it a little better usually
we're talking about getting AIS to do
what we want which is not well defined
we're talking about creator of a system
owner of that AI Humanity as a whole but
we don't agree on much there is no
universally accepted ethics morals
across cultures religions people have
individually very different preferences
politically and such so even if we
somehow managed all the other aspects of
it programming those fuzzy Concepts and
getting to follow them closely we don't
agree on what to program in so my
solution was okay we don't have to
compromise on room temperature you have
your Universe I have mine whatever you
want and if you like me you can invite
me to visit your Universe we don't have
to be independent but the point is you
can be and virtual reality is getting
pretty good it's going to hit a point
where you can't tell the difference and
if you can't tell if it's real or not
what's the difference so basically give
up on value alignment create and entire
it's like the the Multiverse Theory this
just create an entire universe for you
where your values you still have to
align with that individual they have to
be happy in that simulation but it's a
much easier problem to align with one
agent versus 8 billion agents plus
animals aliens so you convert the
multi-agent problem into a single agent
problem I'm trying to do that yeah okay
is there any way to is so okay that's
giving up on the on the value problem
well is there any way to solve the value
alignment problem where there's a bunch
of humans multiple humans tens of humans
or 8 billion humans that have very
different set of values it seems
contradictory I haven't seen anyone
explain what it means outside of kind of
words which pack a lot make it good make
it desirable make it something they
don't regret but how do you specifically
formalize those Notions how do you
program women I haven't seen anyone uh
make progress on that so far but isn't
that the whole optimization Journey that
we're doing as a human civilization
we're looking at
geopolitics nations are in a state of
Anarchy with each other they start wars
there's
conflict and often times they have a
very different views of what is good and
and what is evil isn't that what we're
trying to figure out just together
trying to converge towards that so we're
essentially trying to solve the value
alignment problem with humans right but
the examples you gave uh some of them
are for example two different religions
saying this is our holy sight and we are
not willing to compromise it in any way
if you can make two holy s sites in
virtual world you solve the problem but
if you only have one it's not divisible
you kind of stuck there but what if we
want to be a tension with each other and
that through that tension we understand
ourselves and we understand the world so
that that's the intellectual Journey
we're on we're on as a human
civilization is we create intellectual
and physical conflict and through that
figure stuff out if we go back to that
idea of simulation and this is a
entertainment kind of giving meaning to
us the question is how much suffering is
reasonable for a video game so yeah I
don't mind you know a video game where I
get heptic feedback there is a little
bit of shaking maybe I'm a little scared
I don't want a game where like kids are
tortured
literally that seems unethical at least
by our human standards are you
suggesting it's possible to remove
suffering if we're looking at human
civilization as an optimization problem
so we know there are some humans who
because of a mutation don't experience
physical pain so at least physical pain
can
be mutated out re-engineered out
suffering in terms of meaning like you
burn the only copy of my book is a
little harder but even there you can
manipulate your honic set point you can
change defaults you can reset problem
with that is if you start messing with
your reward Channel you start
wireheading and uh end up bissing out uh
a little too much well that's the the
question would you really want to live
in a world where there's no suffering
that's a dark question is there some
level of
suffering that reminds us of what this
is all for
I I think we need that but I would
change the overall range so right now
it's negative Infinity to kind of
positive Infinity pain pleasure AIS I
would make it like zero to positive
infinity and being unhappy is like I'm
close to
zero okay so what what's the srisk what
are the possible things that you're
imagining with srisk so Mass suffering
of humans what are we talking about
there caused by AGI so there are many
malevolent actors we can talk about
Psychopaths crazies hackers Doomsday
cults we know from history they tried
killing everyone they tried on purpose
to cause maximum amount of damage
terrorism what if someone malevolent
wants on purpose to torture all humans
as long as possible you solve aging so
now you have functional immortality and
you just try to be as creative as you
can do you think there is actually
people in human history that try to
literally maximize human suffering in
just studying people have done evil in
the world it seems that they think that
they're doing good and it doesn't seem
like they're trying to maximize
suffering they just cause a lot of
suffering as a side effect of doing what
they think is good so there are
different malevolent agents some may be
just gaining personal benefit and
sacrificing others to that cause others
will know for a fact trying to kill as
many people as possible when we look at
recent school shootings if they had more
capable weapons they would take out not
dozens but thousands millions
billions well we don't know
that but that is a terrifying
possibility and we don't want to find
out like if terrorists had access to
nuclear weapons how far would they go is
there a limit to what they're
willing to
do in your senses there's some
malevolent actors where there's no limit
there is
mental mental diseases where people
don't have empathy don't
have this
human quality of understanding suffering
in others and then there's also set of
beliefs where you think you're doing
good uh by killing a lot of
humans again I would like to assume that
normal people never think like that it's
always some sort of psychopaths but yeah
and to you AGI systems can carry that
and uh be more competent at executing
that they can certainly be more creative
they can understand human biology better
understand our molecular structure
genome uh again uh a lot of times uh
torture ends then in individual dies
that limit can be removed as well so if
we're actually looking at x risk and
srisk as the systems get more and more
intelligent don't you think it it's
possible to anticipate the ways they can
do it and defend against it like we do
with the cyber security with the do
security systems right uh we can
definitely keep up for a while I'm
saying you cannot do it indefinitely at
some point the cognitive Gap is too big
the surface you have to defend is
infinite but attackers only need to find
one exploit so to you eventually this is
we're heading off a cliff if we create
General super intelligences I don't see
a good outcome long term for Humanity
the only way to win this game is not to
play it okay well we we we'll talk about
possible solutions and what not playing
it means um but what are the possible
timelines here to you what are we
talking about we're talking about a set
of years decades centuries what do you
think I don't know for sure the
prediction markets right now are saying
2026 for AGI I heard the same thing from
CEO of anthropic dip mine so maybe we
are 2 years away which seems very
soon uh given we don't have a working
safety mechanism in place or even a
prototype for one and there are people
trying to accelerate those timelines
because they feel we're not getting
there quick enough but what do you think
they mean when they say AGI so the
definitions we used to have when people
are modifying a little bit lately
artificial general intelligence was a
system capable of performing in any
domain a human could perform so kind of
you creating this average artificial
person they can do cognitive labor
physical labor where you can get another
human to do it superintelligence was
defined as a system which is superior to
All Humans in all domains now people are
starting to refer to AGI as if it's
super intelligence I made a post
recently where I argued for me at least
if you average out over all the common
human tasks those systems are already
smarter than an average human mhm so
under that definition we we have it
Shane L has this definition of where
you're trying to win in all domains
that's what intelligence is now are they
smarter than Elite individuals in
certain domains of course not they're
not there yet but uh the progress is
exponential see I'm much more concerned
about social
engineering so
to me ai's ability to do something in
the physical world like the the lowest
hanging fruit this the easiest set of
methods is by just getting humans to do
it it's going to be much harder to to uh
be the kind of viruses that take over
the minds of
robots that where the robots are
executing the commands it just seems
like humans social engineering of humans
is much more likely that would be enough
to bootst the whole
process okay just to linger on the term
AGI what's what to you is the difference
between AGI and human level intelligence
uh human level is General in the domain
of expertise of humans we know how to do
human things I don't speak dog language
I should be able to pick it up if I'm a
general intelligence it's kind of
inferior animal I should be able to
learn that skill but I can't at general
intelligence truly Universal general
intelligence should be able to do things
like that humans cannot do to be able to
talk to animals for example to solve
pattern recognition problems of that
type to
do of similar things outside of
our domain of expertise because it's
just not the world
will if we just look at the space of
cognitive abilities we have I just would
love to understand what the limits are
Beyond which an AGI system can reach
like what does that look like what about
about
actual mathematical thinking or uh
scientific
innovation that kind of stuff we know
calculators are smarter than humans in
that narrow domain of addition but is it
humans plus tools versus AGI or just
human raw human intelligence cu cu
humans create tools and with the tools
they become more intelligent so like
there there's a gray area there what it
means to be human when we're measuring
their intelligence so when I think about
it I usually think human with like a
paper and a pencil not human with
internet and anava AI helping but is
that a fair way to think about it cuz
isn't there another definition of human
level intelligence that includes the
tools that humans create but we create
AI so at any point you'll still just add
super intelligence to human capability
that seems like
cheating no controllable tools there is
there is an implied leap that you're
making when AGI goes from tool to uh
entity that can make its own decisions
so if we Define human level intelligence
as everything a human can do with fully
controllable tools it seems like a
hybrid of some kind you're now doing
brain computer interfaces you connecting
it to maybe narrow AI yeah it definitely
increases our
capabilities so what's a good test to
you that uh measures whether uh an
artificial intelligence system has
reached human level intelligence and was
a good test where it has superseded
human level intelligence to reach that
land of AGI I am oldfashioned I like
tting test I have a paper where I equate
passing touring test to solving AI
complete problems because you can encode
any questions about any domain into the
touring test you don't have to talk
about how was your day you can ask
anything and so the system has to be as
smart as a human to pass it in a true
sense but then you would extend that to
U maybe a very long conversation like I
think the Alexa prize was doing that
basically can you do a 20 minute 30
minute conversation with an ass system
it has to be long enough to where you
can make some meaningful decisions about
capabilities absolutely you can Brute
Force very short
conversations so like literally what
does that look like can we do uh can we
construct formally a kind of test that
tests for AGI for AGI it has to be there
I cannot give it a task I can give to a
human and it cannot do it if a human can
for super intelligent it would be
superior on all such tasks not just
average performance so like go learn to
drive car go speak Chinese play guitar
okay great I guess the the following
question is there a test for the kind of
AGI that
would
be uh susceptible to lead to srisk or X
risk susceptible to destroy human
civilization like is there a test for
that you can develop a test which will
give you positives if it lies to you or
has those ideas you cannot develop a
test which rules them out there is
always possibility of what bom calls a
treacherous turn where later on a system
decides for game theoretic reasons
economic reasons to change its behavior
and we see the same with humans it's not
unique to AI for Millennia we tried
developing morals ethics religions uh
light detector tests and then employees
betray the employer spouses betray
family it's a pretty standard thing
intelligent agents sometimes do so is it
is it possible to detect when a AI
system is lying or deceiving you if you
know the truth and it tells you
something false you can detect that but
you cannot know in general every single
time and again the system you're testing
today may not be lying the system you're
testing today may know you are testing
it and so behaving and later on after it
interacts with the environment interacts
with other systems malevolent agents
learns more it may start doing those
things so do you think it's possible to
develop a system where the creators of
the system the developers the program
rers don't know that it's deceiving them
so systems today don't have long-term
planning that is not out they can lie
today if it
optimizes helps them optimize the reward
if they realize okay this human will be
very happy if I tell them the following
they will do it if it brings them more
points and they don't have to kind of
keep track of it it's just the right
answer to this problem every single time
at which point is somebody creating that
intentionally not unintentionally
intentionally creating an AI system
that's doing long-term planning with an
objective function that's defined by the
AI system not by a human well some
people think that if they're that smart
they always good they really do believe
that it's just benevolence from
intelligence so they'll always want
what's best for us some people think
that uh they will be able to detect
problem behaviors and correct them at
the time when we get
there I don't think it's a good idea I
am strongly against it but yeah there
are quite a few people who in general
are so optimistic about this technology
it could do no wrong they want it
developed as soon as possible as capable
as possible so there's going to be
people who believe the more intelligent
it is the more benevolent and so
therefore it should be the one that
defines the objective function that it's
U optimizing when it's doing long-term
planning there are even people who say
okay what's so special about humans
right we removed the gender bias we're
removing race bias why is this pro-human
bias we are polluting the planet we are
as you said you know fight a lot of Wars
kind of violent maybe it's better if
this super intelligent perfect uh
Society comes and replaces us it's
normal stage in the evolution of our
species yeah so somebody says uh let's
develop an AI system that removes the
violent humans from the world and then
it turns out that all humans have
violence in them or the capacity for
violence and therefore all humans are
removed yeah yeah yeah let me ask about
uh Yan laon he's somebody who uh you've
had a few exchanges
with and he's somebody who actively
pushes back against this view that AI is
going to lead to destruction of uh human
civilization also known as uh Ai
dorismar and open source are the best
ways to understand and mitigate the
risks and two AI is not something that
just happens we build it we have agency
in what it becomes hence we control the
risks we meaning humans it's not some
sort of natural phenomena that uh we
have no control over so can you can you
make the case that he's right and can
you try to make the case that he's wrong
I cannot make a case that he's right
he's wrong in so many ways it's
difficult for me to remember all of them
uh he is a Facebook buddy so I have a
lot of fun uh having those little
debates with him so I'm trying to
remember the arguments so one he he says
we are not gifted to this intelligence
from Aliens we are designing it we are
making decisions about it that's not
true it was true then we had expert
systems symbolic AI decision threes
today you set up parameters for a model
and you water this plant you give it
data you give it compute and it grows
and after it's finished growing into
this alien plant you start testing it to
find out what capabilities it has and it
takes years to figure out even for
existing models if it's Str for 6 months
it will take you 2 3 years to figure out
basic capabilities of that system we
still discover new capabilities in
systems which are already out there so
that's that's not the case so just to
linger on that to you the difference
there that there is some level of
emergent intelligence that happens in
our current
approaches so stuff that we don't
hardcode in absolutely that's what makes
it so successful then we had to
painstakingly hardcode in everything
we didn't have much progress now just
spend more money and more compute and
it's a lot more capable and then the
question is when there is emergent
intelligent
phenomena what is the ceiling of that
for you there's no ceiling for uh for
Yan laon I think there's a kind of
ceiling that happens that we have full
control over even if we don't understand
the internals of the emergence how the
emergence happens there's a sense that
we have control and understanding of the
approximate ceiling of capability the
limits of the capability let's say there
is a ceiling it's not guaranteed to be
at a level which is competitive with us
it may be greatly Superior to ours so
what
about his statement about open research
and open source are the best ways to
understand and mitigate the risks
historically he's completely right open
source software is wonderful it's tested
by the community it's de
but we're switching from tools to agents
now you're giving open source weapons to
Psychopaths do we want to open source
nuclear weapons biological weapons it's
not safe to give technology so powerful
to those who may misalign it even if you
are successful at somehow getting it to
work in the first place in a friendly
manner but the difference with nuclear
weapons current AI systems are not akin
to nuclear weapons so the idea there is
you're open sourcing it at this stage
that you can understand it better large
large number of people can explore the
limitation the capabilities explore the
possible ways to keep it safe to keep uh
it secure all that kind of stuff while
it's not at the stage of nuclear weapons
so nuclear weapons there's a no nuclear
weapon and then there's a nuclear weapon
with AI systems there's a gradual
Improvement of capability and you get
to uh perform that Improvement
incrementally and so open source allows
you to study
uh how things go wrong I study the the
very process of emergence study AI
safety on those systems when there's not
a high level of danger all that kind of
stuff it also sets a very wrong
precedence so we open sourced model one
model two model three nothing ever bad
happened so obviously we're going to do
it with model four it's just gradual
Improvement I I don't think it always
works with the precedent like you're not
stuck doing it the
way you always did it just uh it's
that's a precedent of open research and
open development such that we get to
learn together and then the first time
there's a sign of
danger some dramatic thing happen not a
thing that destroys human civilization
but some dramatic demonstration of
capability that can legitimately lead to
a lot of damage then everybody wakes up
and says okay we need to regulate this
we need to come up with safety mechanism
that stops this right but at this time
maybe can educate me but I haven't seen
any illustration of significant damage
done by intelligent AI systems so I have
a paper which collects accidents through
history of AI and they always are
proportionate to capabilities of that
system so if you have Tic Tac to playing
AI it will fail to properly play and
lose the game which it should draw
trivial your spell checker will be
spellward so on uh I stopped collecting
those because there are just too many
examples of AI failing at what they are
capable of we haven't had terrible
accidents in a sense of billion people
got killed absolutely true but in
another paper I argue that those
accidents do not actually prevent people
from continuing with research and
actually they kind of serve like
vaccines a vaccine makes your body a
little bit sick so you can handle the
big disease later much better it's the
same here people will point out you know
that accident AI accident we had where
12 people died everyone's still here 12
people is less than smoking kills it's
not a big deal so we continue so in a
way it will actually be kind of
confirming that it's not that bad it
matters how the deaths happen whether
it's literally Murder By thei system
then one is a problem but if it's
accidents because of increased Reliance
on automation for example so when uh
airplanes are flying in an automated way
maybe the number of plane crashes
increased by 177% or something and then
you're like okay do we really want to
rely on automation I think in a case of
automation airplanes it decrease
significantly okay same thing with
autonomous vehicles like okay uh what
are the pros and cons what are the W
with the trade-offs here you can have
that discussion in an honest way but I
think the kind of things we're talking
about here is mass
scale pain and
suffering caused by AI systems and I
think we need to see illustrations of
that on a very small scale to start to
understand that this is really damaging
versus clippy versus a tool that's
really useful to a lot of people to do
learning to do um summarization of text
to do question answer all that kind of
stuff to generate videos a tool
fundamentally a tool versus an agent
that can do a lot a huge amount of
damage so you bring up example of cars
yes cars were slowly developed and
integrated if we had no cars and
somebody came around and said I invented
this thing it's called cars it's awesome
it kills like a 100,000 Americans every
year let's deploy it m would we deploy
that there's been fear mongering about
cars for a long time from the the the
transition from horses cars there's a
there's a really nice channnel that I
recommend people check out pessimist
archive that documents all the fear
mongering about technology that's
happened throughout history there's
definitely been a lot of fear-mongering
about cars there's a transition period
there about cars about how deadly they
are we can try it took a very long time
for cars to proliferate to the degree
they have now and then you could ask
serious questions uh in terms of the
miles traveled the benefit to the
economy the benefit to the quality of
life that cars do versus the number of
deaths 30 40,000 in the United States
are we willing to pay that price I think
most people when they're rationally
thinking policy makers will say
yes it's we want to
decrease it from 40,000 to zero and do
everything we can to decrease it there's
all kinds of policies incentives you can
create to decrease the risks uh with the
uh deployment of Technology but then you
have to weigh the benefits and the risk
the technology and the same thing would
be done with with with AI you need data
you need to know but if I'm right and
it's unpredictable unexplainable
uncontrollable you cannot make this
decision we're gaining $10 trillion of
wealth but we're losing we don't know
how many people uh you basically have to
perform an experiment on 8 billion
humans without their consent and even if
they want to give you consent they can't
because they cannot give informed
consent they don't understand those
things
right that happens when you do when you
go from the predictable to the
unpredictable very
quickly you just uh but it's not obvious
to me that AI systems would gain
capability so quickly that you won't be
able to collect enough data to study the
sa the benefits and
risks we literally doing it the previous
model we learned about after we finish
training it what it was capable of let's
say we stopped GPT for training run
around human cap capability
hypothetically we start training GPT 5
and I have no knowledge of Insider
training runs or anything and we started
that point of about human and we train
it for the next 9 months maybe 2 months
in it becomes super intelligent we
continue training it at the time when we
start uh testing it it is already a
dangerous system how dangerous I have no
idea but neither people training it at
the training stage but then there's a
testing stage mhm inside the company
they can start getting intuition about
what the system is capable to do you're
saying that somehow from leap from GPT 4
to GPT 5 can
happen the kind of leap where GPT 4 was
controllable in GPT 5 is no longer
controllable and we get no insights from
using GPT 4 about the fact that GPT 5
will be
uncontrollable like that's the that's
the situation you're concerned about
where there leap from n to n plus one
would be such that uncontrollable system
is created
without
any ability for us to anticipate that if
we had capability of ahead of the run
before the training run to register
exactly what capabilities that next
model will have at the end of a training
run and we accurately guessed all of
them I would say you're right we can
definitely go ahead with this run we
don't have that capability from gp4 you
can build up intuition about what GPT 5
will be capable of it's just incremental
progress MH even if that's a big leap in
capability it just doesn't seem like you
can take a leap from a system that's uh
helping you write emails to a system
that's going to destroy human
civilization it seems like it's always
going to be sufficiently incremental
such that we can anticipate the possible
dangers and we're not even talking about
existential risks but just the the kind
of damage can do to civilization it
seems like we'll be able to anticipate
the kinds not the exact but the kinds of
uh risks it might lead to and then
rapidly develop defenses ahead of time
and as the risks emerge we're not
talking just about capabilities specific
tasks we're talking about General
capability to learn maybe like a child
at the time of testing and deployment it
is still not extremely capable but as it
is exposed to more data real world it
can be trained to become much more
dangerous and capable so let's let's
focus then on the control
problem at which point does the system
become
uncontrollable why is it the more likely
trajectory for you that the system
becomes
uncontrollable so I think at some point
it becomes capable of getting out of
control for game theoretic reasons it
may decide not to do anything right away
and for a long time just collect more
resources
accumulate strategic Advantage right
away it may be kind of still young weak
super intelligence give it a decade it's
in charge of a lot more resources it had
time to make backups so it's not obvious
to me that it will strike as soon as it
can can we just try to imagine this
future with there's an AI system that's
capable of
uh escaping in control of humans and
then doesn't and waits what's that look
like so one we have to rely on that
system for a lot of the infrastructure
so we have to give it access not just to
the internet but to the task of
managing uh Power government economy
this kind of stuff so and that just
feels like a gradual process given the
bureaucracies of all those systems
involved we've been doing it for years
software controls all those systems
nuclear power plants airline industry
it's all software based every time there
is electrical outage I can't fly
anywhere for days but there's a
difference between
software and
AI there's different kinds of software
so to give a single AI system access to
the control of Airlines and the control
of the
economy that's not a that's not a
trivial transition for Humanity no but
if it shows it is safer in fact fact
then it's in control we get better
results people will demand that it put
in place and if not it can hack the
system it can use social engineering to
get access to it that's why I said it
might take some time for it to
accumulate those resources it just feels
like that would take a long time for
either humans to trust it or for the
social engineering to come into play
like it's not a thing that happens
overnight it feels like something that
happens across one or two decades I
really hope you're right but it's not
what I'm seeing people are very quick to
jump on a latest Trend early adopters
will be there before it's even deployed
buying prototypes maybe the social
engineering I can see because so for
social engineering AI systems don't need
any hardware access they just it's all
software so they can start manipulating
you through social media so on like you
have ai assistants they're going to help
you do a lot of manage a lot of your
day-to-day and then they start doing
social engineering but like for a system
that's so capable that is can escape the
control of humans that created it such a
system being deployed at a mass
scale and trusted by people to be
deployed it feels like that would take a
lot of
convincing so we've been deploying
systems which had hidden
capabilities can you give an example gp4
I don't know what else is capable of but
there are still things we haven't
discovered can do there may be trial
proportional to his capability I don't
know it writes Chinese poetry
hypothetical I know it does but we
haven't tested for all possible
capabilities and we are not explicitly
designing them MH we can only rule out
bugs we find we cannot rule out bugs and
capabilities because we haven't found
them is it possible for a system to have
hidden
capabilities that are orders a magnitude
greater than its non-hidden
capabilities this is the thing I'm
really struggling with where on the
surface the thing we understand it can
do doesn't seem that harmful so if even
if it has bugs even if it has hidden
capabilities like Chinese poetry or
generating effective
viruses uh software
viruses the damage that can do seems
like on the same order of magnitude as
it's
uh the the capabilities that we know
about so like this this idea that the
hidden capabilities will include being
uncontrollable this is something I'm
struggling with cuz GPT 4 on the surface
seems to be very controllable again we
can only ask and test for things we know
about if there are unknown unknowns we
cannot do it I'm thinking of human
statistics of an right if you talk to a
person like that you may not even
realize they can multiply 20 digit
number numbers in their head you have to
know to
ask so as I mentioned just to sort of
Linger on
the the fear of the
unknown so the pessimist archive has
just documented let's look at data of
the past at history there's been a lot
of fearmongering about technology
pessimist archive does a really good job
of documenting how crazily afraid we are
of every piece of technology we've been
afraid there's a blog post where anlo
who created pessimus archive writes
about the fact that we've been uh
fear-mongering about robots and
automation for for over 100 years so why
is Agi different than the kinds of
Technologies we've been afraid of in the
past so two things one we switching from
tools to agents tools don't have
negative or positive impact people using
tools do so guns don't kill people with
guns do agents can make their own
decisions they can be positive or
negative a pitbull can decide to harm
you it's an agent the fears are the same
the only difference is now we have this
technology then they were afraid of
humano robots 100 years ago they had
none today every major company in the
world is investing billions to create
them not every but you understand what
I'm saying yes it's very different well
agents
uh it depends on what you mean by the
word agents the all those companies are
not investing in a system that has the
kind of
agency that's implied by in the fears
where it can really make decisions on
their own that have no human in the
loop they are saying they're building
super intelligence and have a super
alignment team you don't think they're
trying to create a system smart enough
to be an independent agent under that
definition I have not seen evidence of
it I I think a lot of it is marketing
uh is is a is a marketing kind of
discussion about the future and it's a
it's a mission about the kind of systems
we can create in the long-term future
but in the short term the kind of
systems they're creating Falls fully
within the definition of narrow AI these
are tools that have increasing
capabilities but they're just don't have
a sense of agency or Consciousness or
self-awareness or ability to deceive at
Scales that would require would be
required to do like Mass scale suffering
and murder of humans those systems are
well beyond Naro AI if you had to list
all the capabilities of GPT 4 you would
spend a lot of time writing that list
but agency is not one of them not yet
but do you think any of those companies
are holding back because they think it
may be not safe or are they developing
the most capable system they can give
the resources and hoping they can
control and
monetize control and monetize hoping
they can control and monetize so you're
saying if they could press a button and
create an
agent that they no longer control that
they can have to ask
nicely a thing that's lives on a server
across huge number of uh
computers you're saying that they would
uh push for the creation of that kind of
system I mean I can't speak for other
people for all of them I think some of
them are very ambitious they fundraising
trillions they talk about controlling
the light corn of the Universe I would
guess that they
might well that's a human question
whether humans are capable of that
probably some humans are capable of that
my more direct question if it's possible
to create such a
system have a system that has that level
of
agency I I don't think that's an easy
technical
challenge we're not it doesn't I feel
like we're close to that A system that
has the kind of agency where it can make
its own decisions and deceive everybody
about them the current
architecture we have in machine learning
and how we train the systems how deploy
the systems and all that it just doesn't
seem to support that kind of agency I
really hope you are right uh I think the
scaling hypothesis is correct we haven't
seen diminishing returns it used to be
we asked how long before AGI now we
should ask how much until AGI it's
trillion dollars today it's a billion
dollars next year it's a million dollar
in a few years don't you think it's
possible basically run out of
trillions so is this constrained by
compute compute gets cheaper every day
exponentially but then then that becomes
a question of decades versus years if
the only disagreement is that it will
take decades not years for everything
I'm saying to materialize then I can go
with that
but if it takes decades then uh the
development of tools for AI
safety uh becomes more and more
realistic so I guess the question
is I have a fundamental belief that
humans when faced with danger can come
up with ways to defend defend against
that
danger and one of the big problems
facing AI safety currently for me is
that there's not clear illustrations of
what that danger looks
like there's no illustrations of AI
systems doing a lot of damage and so
it's unclear what you're defending
against because currently it's a
philosophical Notions that yes it's
possible to imagine AI systems that take
control of everything and Destroy All
Humans it's also a more formal
mathematical notion that you talk about
that it's impossible to have a perfectly
secure system you can't you can't prove
that a program of sufficient complexity
is uh completely safe and and perfect
and you know everything about it yes but
like when you actually just
pragmatically look how much damage have
the AI systems done and what kind of
damage there's not been illustrations of
that even in autonomous weapon
systems there's not been mass
deployments of autonomous weapon systems
luckily um the Automation in war
currently is very
limited the that the automation is at
the scale of individuals versus like at
the scale of strategy and planning so I
think one of the challenges here is like
where is the
dangers uh and the intuition that yam
Lun and others have is let's keep in the
open building AI systems until the
dangers start rearing their
heads and they become more explicit
there there start being uh case studies
illustrative uh case studies that show
exactly how the damage by as systems is
done then regulation can step in then
brilliant Engineers can step up and we
can have Manhattan style projects that
defend against such systems that's kind
of the no the
notion and I guess attention with that
is the idea that for you we need to be
thinking about that now so that we're
we're ready because we we'll have not
much time once the systems are
deployed is that true so there is a lot
to unpack here uh there is a partnership
on AI a conglomerate of many large
corporations they have a database of AI
accidents they collect I contributed a
lot to that database if we so far made
almost no progress in actually solving
this problem not patching it not again
lipstick and a p kind of
solutions why would we think we'll do
better than we closer to the
problem uh all the things you mentioned
are serious concerns measuring the
amount of harm so benefit versus risk
there is is difficult but to you the
sense is already the risk has superseded
the benefit again I I want to be
perfectly clear I love AI I love
technology I'm a computer scientist I
have PhD in engineering I work at an
engineering school there is a huge
difference between we need to develop
narrow AI systems super intelligent in
solving specific human problems like
protein folding and let's create super
intelligent machine G and will decide
what to do with us yeah those not the
same I am against the super intelligence
in general sense with No undo button do
you think the teams that are doing
they're able to do the AI safety on the
the kind of narrow AI
risks that you've
mentioned are those approaches going to
be at all productive towards leading to
approaches of doing AI safety on
AGI or is it just a fundamentally
different partially but they don't scale
for narrow AI for deterministic systems
you can test them you have edge cases
you know what the answer should look
like you know the right answers for
General systems you have infinite test
surface you have no edge cases you
cannot even know what to test for again
the unknown unknowns a
Resume
Read
file updated 2026-02-14 19:01:38 UTC
Categories
Manage