Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podcast #452
ugvHCXCOmm4 • 2024-11-11
Transcript preview
Open
Kind: captions
Language: en
if you extrapolate the curves that we've
had so far right if if you say well I
don't know we're starting to get to like
PhD level and and last year we were at
undergraduate level and the year before
we were at like the level of a high
school student again you can you can
quibble with at what tasks and for what
we're still missing modalities but those
are being added like computer use was
added like image generation has been
added if you just kind of like eyeball
the rate at which these capabilities are
increasing it does make you think that
we'll get there by 2026 or 2027 I think
there are still worlds where it doesn't
happen in in a 100 years those world the
number of those worlds is rapidly
decreasing we are rapidly running out of
truly convincing blockers truly
compelling reasons why this will not
happen in the next few years the scale
up is very quick like we we do this
today we make a model and then we deploy
thousands maybe tens of thousands of
instances of it I think by the time you
know certainly within two to three years
whether we have these super powerful AIS
or not ERS are going to get to the size
where you'll be able to deploy millions
of these I am optimistic about meaning I
worry about economics and the
concentration of power that's actually
what I worry about more the abuse of
power and AI increases the amount of
power in the world and if you
concentrate that power and abuse that
power it can do immeasurable damage yes
it's very frightening it's very it's
very
frightening the following is a
conversation with Dario amade CEO of
anthropic the company that created
Claude that is currently and often at
the top of most llm Benchmark leader
boards on top of that Dario and the
anthropic team have been outspoken
advocates for taking the topic of AI
safety very seriously and they have
continued to publish a lot of
fascinating AI research on this and
other topics I'm also joined afterwards
by two other brilliant people from
propic first Amanda ascal who is a
researcher working on alignment and
fine-tuning of Claude including the
design of claude's character and
personality a few folks told me she has
probably talked with Claude more than
any human at anthropic so she was
definitely a fascinating person to talk
to about prompt engineering and
practical advice on how to get the best
out of Claude after that chrisa stopped
by for chat he's one of the pioneers of
the field of mechanistic
interpretability which is an exciting
set of efforts that aims to reverse
engineer neural networks to figure out
what's going on inside inferring
behaviors from neural activation
patterns inside the network this is a
very promising approach for keeping
future super intelligent AI systems safe
for example by detecting from the
activations when the model is trying to
deceive the human it is talking
to this is Alex Freedman podcast to
support it please check out our sponsors
in the description and now dear friends
here's Dario
amade let's start with a big idea of
scaling laws and the scaling hypothesis
what is it what is its history and where
do we stand today so I can only describe
it as it you know as it relates to kind
of my own experience but I've been in
the AI field for about uh 10 years and
it was something I noticed very early on
so I first joined the AI world when I
was uh working at BYU with Andrew in in
late 2014 which is almost exactly 10
years ago now and the first thing we
worked on was speech recognition systems
and in those days I think deep learning
was a new thing it had made lots of
progress but everyone was always saying
we don't have the algorithms we need to
succeed you know we we we we're we're
not we're only matching a tiny tiny
fraction there's so much we need to kind
of discover algorithmically we haven't
found the picture of how to match the
human brain uh and when you know in some
ways was fortunate I was kind of you
know you can have almost beginner's luck
right I was like a a newcomer to the
field and you know I looked at the
neural net that we were using for speech
the recurrent neural networks and I said
I don't know what if you make them
bigger and give them more layers and
what if you scale up the data along with
this right I just saw these as as like
independent dials that you could turn
and I noticed that the model started to
do better and better as you gave them
more data as you as you made the models
larger as you trained them for longer um
and I I didn't measure things precisely
in those days but but along with with
colleagues we very much got the informal
sense that the more data and the more
compute and the more training you put
into these models the better they
perform and so initially my thinking was
hey maybe that is just true for speech
recognition systems right maybe maybe
that's just one particular quirk one
particular area I think it wasn't until
2017 when I first saw the results from
gpt1 that it clicked for me that
language is probably the area in which
we can do this we can get trillions of
words of language data we can train on
them and the models we were training in
those days were tiny you could train
them on one to eight gpus whereas you
know now we train jobs on tens of
thousands soon going to hundreds of
thousands of gpus and so when I when I
saw those two things together um and you
know there were a few people like ilaser
who who you've interviewed who had
somewhat similar reviews right he might
have been the first one although I think
a few people came to came to similar
views around the same time Right There
Was You Know Rich Sutton's bitter lesson
there was gur wrote about the scaling
hypothesis but I think somewhere between
2014 and 2017 was when it really clicked
for me when I really got conviction that
hey we're going to be able to do these
incredibly wide cognitive tasks if we
just if we just scale up the models and
at at every stage of scaling there are
always arguments and you know when I
first heard them honestly I thought
probably I'm the one who's wrong and you
know all these all these experts in the
field are right they know the situation
better better than I do right there's
you know the Chomsky argument about like
you can get syntactics but you can't get
semantics there's this idea oh you can
make a sentence make sense but you can't
make a paragraph makes sense the latest
one we have today is uh you know we're
going to run out of data or the data
isn't high quality enough or models
can't reason and and each time every
time we manage to we manage to either
find a way around or scaling just is the
way around um sometimes it's one
sometimes it's the other uh and and so
I'm now at this point I I I still think
you know it's it's it's always quite
uncertain we have nothing but inductive
inference to tell us that the next few
years are going to be like the next the
last 10 years but but I've seen I've
seen the movie enough times I've seen
the story happen for for enough times to
to really believe that probably the
scaling is going to continue and that
there's some magic to it that we haven't
really explained on a theoretical basis
yet and of course the scaling here is
bigger networks bigger data bigger
compute yes all in in particular linear
scaling up of bigger networks bigger
training times and uh more and and more
data uh so all of these things almost
like a chemical reaction you know you
have three ingredients in the chemical
reaction and you need to linearly scale
up the three ingredients if you scale up
one not the others you run out of the
other reagents and and the reaction
stops but if you scale up everything
everything in series then then the
reaction can proceed and of course now
that you have this kind of empirical
scienceart you can apply it to
other uh more nuanced things like
scaling laws applied to interpretability
or scaling laws applied to posttraining
or just seeing how does this thing scale
but the big scaling law I guess the
underlying scaling hypothesis has to do
with big networks Big Data leads to
intelligence yeah we've we've documented
scaling laws in lots of domains other
than language right so uh initially the
the paper we did that first showed it
was in early 2020 where we first showed
it for language there was then some work
late in 2020 where we showed the same
thing for other modalities like images
video
text to image image to text math they
all had the same pattern and and you're
right now there are other stages like
posttraining or there are new types of
reasoning models and in in in all of
those cases that we've measured we see
similar similar types of scaling laws a
bit of a philosophical question but
what's your intuition about why bigger
is better in terms of network size and
data size why does it lead to more
intelligent models so in my previous
career as a as a biophysicist so I did
physics undergrad and then biophysics in
in in in grad school so I think back to
what I know as a physicist which is
actually much less than what some of my
colleagues at anthropic have in terms of
in terms of expertise in physics uh
there's this there's this concept called
the one over F noise and one overx
distributions um where where often um uh
you know just just like if you add up a
bunch of natural processes you get
gaussian if you add up a bunch of kind
of differently distributed natural
processes if you like if you like take a
take a um probe and and hook it up to a
resistor the distribution of the thermal
noise in the resistor goes as one over
the frequency um it's some kind of
natural convergent distribution uh and
and I I I I and and I think what it
amounts to is that if you look at a lot
of things that are that are produced by
some natural process that has a lot of
different scales right not a gaussian
which is kind of narrowly distributed
but you know if I look at kind of like
large and small fluctuations that lead
to lead to electrical noise um they have
this decaying 1 overx distribution and
so now I think of like patterns in the
physical world right if I if or or in
language if I think about the patterns
in language there are some really simple
patterns some words are much more common
than others like the' then there's basic
noun verb structure then there's the
fact that you know you know nouns and
verbs have to agree they have to
coordinate and there's the higher level
sentence structure then there's the
Thematic structure of paragraphs and so
the fact that there's this regressing
structure you can imagine that as you
make the networks larger first they
capture the really simple correlations
the really simple patterns and there's
this long taale of other patterns and if
that long taale of other patterns is
really smooth like it is with the one
over F noise in you know physical
processes like like like resistors then
you could imagine as you make the
network larger it's kind of capturing
more and more of that distribution and
so that smoothness gets reflected in how
well the models are at predicting and
how well they perform language is an
evolved process right we've we've
developed language we have common words
and less common words we have common
expressions and less common Expressions
we have ideas cliches that are expressed
frequently and we have novel ideas and
that process has has developed has
evolved with humans over millions of
years and so the the the guess and this
is pure speculation would be would be
that there is there's some kind of
longtail distribution of of of the
distribution of these ideas so there's
the long tail but also there's the
height of the hierarchy of Concepts that
you're building up so the bigger the
network presumably you have a higher
capacity to exactly if you have a small
Network you only get the common stuff
right if if I take a tiny neural network
it's very good at understanding that you
know a sentence has to have you know
verb adjective noun right but it's it's
terrible at deciding what those verb
adjective and noun should be and whether
they should make sense if I make it just
a little bigger it gets good at that
then suddenly it's good at the sentences
but it's not good at the paragraphs and
so the these these rare and more complex
patterns get picked up as I add as I add
more capacity to the network well the
natural question then is what's the
ceiling of this like how complicated and
complex is the real world how much of
stuff is there to learn I don't think
any of us knows the answer to that
question um I my strong Instinct would
be that there's no ceiling below level
of humans right we humans are able to
understand these various patterns and so
that that makes me think that if we
continue to you know scale up these
these these models to kind of develop
new methods for training them and
scaling them up uh that will at least
get to the level that we've gotten to
with humans there's then a question of
you know how much more is it possible to
understand than humans do how much how
much is it possible to be smarter and
more perceptive than humans I I would
guess the answer has has got to be
domain dependent if I look at an area
like biology and you know I wrote this
essay Machines of Loving Grace it seems
to me that humans are struggling to
understand the complexity of biology
right if you go to Stanford or to
Harvard or to Berkeley you have whole
Departments of you know folks trying to
study you know like the immune system or
metabolic pathways and and each person
understands only a tiny bit part of it
specializes and they're struggling to
combine their knowledge with that of
with that of other humans and so I have
an instinct that there's there's a lot
of room at the top for AIS to get
smarter if I think of something like
materials in the in the physical world
or you know um like addressing you know
conflicts between humans or something
like that I mean you know it it may be
there's only some of these problems are
not intractable but much harder and and
it it may be that there's only there's
only so well you can do with some of
these things right just like with speech
recognition there's only so clear I can
hear your speech so I think in some
areas there may be ceilings in in in you
know that are very close to what humans
have done in other areas those ceilings
may be very far away and I think we'll
only find out when we build these
systems uh there's it's very hard to
know in advance we can speculate but we
can't be sure and in some domains the
ceiling might have to do with human
bureaucracies and things like this as
you're right about yes so humans
fundamentally have to be part of the
loop that's the cause of the ceiling not
maybe the limits of the intelligence
yeah I think in many cases um you know
in theory technology could change very
fast for example all the things that we
might invent with respect to biology um
but remember there's there's a you know
there's a clinical trial system that we
have to go through to actually
administer these things to humans I
think that's a mixture of things that
are unnecessary and bureaucratic and
things that kind of protect the
Integrity of society and the whole
challenge is that it's hard to tell it's
hard to tell what's going on uh it's
hard to tell which is which right my my
view is definitely I think in terms of
drug development we my view is that
we're too slow and we're too
conservative but certainly if you get
these things wrong you know it's it's
possible to to to risk people's lives by
by being by being by being too Reckless
and so at least at least some of these
human institutions are in fact
protecting people so it's it's all about
finding the balance I strongly suspect
that balance is kind of more on the side
of pushing to make things happen faster
but there is a balance if we do hit a
limit if we do hit a Slowdown in the
scaling laws what do you think would be
the reason is it compute limited data
limited uh is it something else idea
limited so a few things now we're
talking about hitting the limit before
we get to the level of of humans and the
skill of humans um so so I think one
that's you know one that's popular today
and I think you know could be a limit
that we run into I like most of the
limits I would bet against it but it's
definitely possible is we simply run out
of data there's only so much data on the
internet and there's issues with the
quality of the data right you can get
hundreds of trillions of words on the
internet but a lot of it is is
repetitive or it's search engine you
know search engine optimization driil or
maybe in the future it'll even be text
generated by AIS itself uh and and so I
think there are limits to what to to
what can be produced in this way that
said we and I would guess other
companies are working on ways to make
data synthetic uh where you can you know
you can use the model to generate more
data of the type that you have that you
have already or even generate data from
scratch if you think about uh what was
done with uh deep mines Alpha go zero
they managed to get a bot all the way
from you know no ability to play Go
whatsoever to above human level just by
playing against itself there was no
example data from humans required in the
the alphao zero version of it the other
direction of course is these reasoning
models that do Chain of Thought and stop
to think um and and reflect on their own
thinking in a way that's another kind of
synthetic data coupled with
reinforcement learning so my my guess is
with one of those methods we'll get
around the data limitation or there may
be other sources of data that are that
are available um we could just observe
that even if there's no problem with
data as we start to scale models up they
just stop getting better it's it seemed
to be a a reliable observation that
they've gotten better that could just
stop at some point for a reason we don't
understand um the answer could be that
we need to uh you know we need to invent
some new architecture um it's been there
have been problems in the past with with
say numerical stability of models where
it looked like things were were leveling
off but but actually you know know when
we when we when we found the right
Unblocker they didn't end up doing so so
perhaps there's new some new
optimization method or some new uh
Technique we need to to unblock things
I've seen no evidence of that so far but
if things were to to slow down that
perhaps could be one reason what about
the limits of compute meaning uh the
expensive uh nature of building bigger
and bigger data centers so right now I
think uh you know most of the Frontier
Model companies I would guess are are
operating you know roughly you know $1
billion scale plus or minus a factor of
three right those are the models that
exist now or are being trained now uh I
think next year we're going to go to a
few billion and then uh 2026 we may go
to uh uh you know above 10 10 10 billion
and probably by 2027 their Ambitions to
build hundred hundred billion dollar uh
hundred billion dollar clusters and I
think all of that actually will happen
there's a lot of determination to build
the compute to do it within this country
uh and I would guess that it actually
does happen now if we get to 100 billion
that's still not enough compute that's
still not enough scale then either we
need even more scale or we need to
develop some way of doing it more
efficiently of Shifting The Curve um I
think be between all of these one of the
reasons I'm bullish about powerful AI
happening so fast is just that if you
extrapolate the next few points on the
curve we're very quickly getting towards
human level ability right some of the
new models that that we developed some
some reasoning models that have come
from other companies they're starting to
get to what I would call the PHD or
professional level right if you look at
their their coding ability um the latest
model we released Sonet 3.5 the new or
updated version it gets something like
50% on sbench and sbench is an example
of a bunch of professional real world
software engineering tasks at the
beginning of the year I think the
state-of-the-art was three or 4% so in
10 months we've gone from 3% to 50% on
this task and I think in another year
we'll probably be at 90% I mean I don't
know but might might even be might even
be less than that uh we've seen similar
things in graduate level math physics
and biology from Models like open AI 01
uh so uh if we if we just continue to
extrapolate this right in terms of skill
skill that we have I think if we
extrapolate the straight curve Within a
few years we will get to these models
being you know above the the highest
professional level in terms of humans
now will that curve continue you've
pointed to and I've pointed to a lot of
reasons why you know possible reasons
why that might not happen but if the if
the extrapolation curve continues that
is the trajectory we're on so anthropic
has several competitors it'd be
interesting to get your sort of view of
it all open aai Google xai meta what
does it take to win in the broad sense
of win in the space yeah so I want to
separate out a couple things right so
you know anthropics anthropic mission is
to kind of try to make this all go well
right and and you know we have a theory
of change called race to the top right
race to the top is about trying to push
the other players to do the right thing
by setting an example it's not about
being the good guy it's about setting
things up so that all of us can be the
good guy I'll give a few examples of
this early in the history of anthropic
one of our co-founders Chris Ola who I
believe you're you're interviewing soon
you know he's the co-founder of the
field of mechanistic interpretability
which is an attempt to understand what's
going on inside AI models uh so we had
him and one of our early teams focus on
this area of interpretability which we
think is good for making models safe and
transparent for three or four years that
had no commercial application whatsoever
it still doesn't today we're doing some
early betas with it and probably it will
eventually but uh you know this is a
very very long research bed in one in
which we've we've built in public and
shared our results publicly and and we
did this because you know we think it's
a way to make models safer an
interesting thing is that as we've done
this other companies have started doing
it as well in some cases because they've
been inspired by it in some cases
because they're worried that uh you know
if if other companies are doing this
that look more responsible they want to
look more responsible too no one wants
to look like the irresponsible ible
actor and and so they adopt this they
adopt this as well when folks come to
anthropic interpretability is often a
draw and I tell them the other places
you didn't go tell them why you came
here um and and then you see soon that
there that there's interpretability
teams else elsewhere as well and in a
way that takes away our competitive
Advantage because it's like oh they now
others are doing it as well but it's
good it's good for the broader system
and so we have to invent some new thing
that we're doing others aren't doing as
well and the hope is to basically bid up
bid up the importance of of of doing the
right thing and it's not it's not about
us in particular right it's not about
having one particular good guy other
companies can do this as well if they if
they if they join the race to do this
that's that's you know that's the best
news ever right um uh it's it's just
it's about kind of shaping the
incentives to point upward instead of
shaping the incentives to point to point
downward and we should say this example
the field of uh mechanistic
interpretability is just a a rigorous
non handwavy way of doing AI safety yes
or it's tending that way trying to I
mean I I think we're still early um in
terms of our ability to see things but
I've been surprised at how much we've
been able to look inside these systems
and understand what we see right unlike
with the scaling laws where it feels
like there's some you know law that's
driving these models to perform better
on on the inside the models aren't you
know there's no reason why they should
be designed for us to understand them
right they're designed to operate
they're designed to work just like the
human brain or human biochemistry
they're not designed for a human to open
up the hatch look inside and understand
them but we have found and you know you
can talk in much more detail about this
to Chris that when we open them up when
we do look inside them we we find things
that are surprisingly interesting and as
a side effect you also get to see the
beauty of these models you get to
explore the sort of uh the beautiful n
nature of large neural networks through
the me turb kind ofy I'm amazed at how
clean it's been I I'm amazed at things
like induction heads I'm amazed at
things like uh you know that that we can
you know use sparse autoencoders to find
these directions within the networks uh
and that the directions correspond to
these very clear Concepts we
demonstrated this a bit with the Golden
Gate Bridge clad so this was an
experiment where we found a direction
inside one of the the neural network
layers that corresponded to the Golden
Gate Bridge and we just turned that way
up and so we we released this model as a
demo it was kind of half a joke uh for a
couple days uh but it was it was
illustrative of of the method we
developed and uh you could you could
take the Golden Gate you could take the
model you could ask it about anything
you know you know it would be like how
you could say how was your day and
anything you asked because this feature
was activated would connect to the
Golden Gate Bridge so it would say you
know I'm I'm I'm feeling relaxed and
expansive much like the the arches of
the Golden Gate Bridge or you know it
would masterfully change topic to the
Golden Gate Bridge and it integrated
there was also a sadness to it to to the
focus ah had on the Golden Gate Bridge I
think people quickly fell in love with
it I think so people already miss it
because it was taken down I think after
a day somehow these interventions on the
model um where where where where you
kind of adjust Its Behavior somehow
emotionally made it seem more human than
any other version of the model strong
personality strong ID strong personality
it has these kind of like obsessive
interests you know we can all think of
someone who's like obsessed with
something so it does make it feel
somehow a bit more human let's talk
about the present let's talk about
Claude so this year A lot has happened
in March claw 3 Opa Sonet Hau were
released then claw 35 Sonet in July with
an updated version just now released and
then also claw 35 hi coup was released
okay can you explain the difference
between Opus Sonet and Haiku and how we
should think about the different
versions yeah so let's go back to March
when we first released uh these three
models so you know our thinking was you
different companies produce kind of
large and small models better and worse
models we felt that there was demand
both for a really powerful model um you
know and you that might be a little bit
slower that you'd have to pay more for
and also for fast cheap models that are
as smart as they can be for how fast and
cheap right whenever you want to do some
kind of like you know difficult analysis
like if I you know I want to write code
for instance or you know I want to I
want to brainstorm ideas or I want to do
creative writing I want the really
powerful model but then there's a lot of
practical applications in a business
sense where it's like I'm interacting
with a website I you know like I'm like
doing my taxes or I'm you know talking
to uh you know to like a legal adviser
and I want to analyze a contract or you
know we have plenty of companies that
are just like you know you know I want
to do autocomplete on my on my IDE or
something uh and and for all of those
things you want to act fast and you want
to use the model very broadly so we
wanted to serve that whole spectrum of
needs um so we ended up with this uh you
know this kind of poetry theme and so
what's a really short poem it's a Haik
cou and so Haiku is the small fast cheap
model that is you know was at the time
was released surprisingly surprisingly
uh intelligent for how fast and cheap it
was uh sonnet is a is a medium-sized
poem right a couple paragraphs since o
Sonet was the middle model it is smarter
but also a little bit slower a little
bit more expensive and and Opus like a
magnum opus is a large work uh Opus was
the the largest smartest model at the
time um so that that was the original
kind of thinking behind it um and our
our thinking then was well each new
generation of models should shift that
tradeoff curve uh so when we release
Sonet 3.5 it has the same roughly the
same you know cost and speed as the
Sonet 3 Model uh but uh it it increased
its intelligence to the point where it
was smarter than the original Opus 3
Model uh especially for code but but
also just in general and so now you know
we've shown results for a Hau 3. 5 and I
believe Hau 3.5 the smallest new model
is about as good as Opus 3 the largest
old model so basically the aim here is
to shift the curve and then at some
point there's going to be an opus 3.5 um
now every new generation of models has
its own thing they use new data their
personality changes in ways that we kind
of you know try to steer but are not
fully able to steer and and so uh
there's never quite that exact
equivalence the only thing you're
changing is intelligence um we always
try and improve other things and some
things change without us without us
knowing or measuring so it's it's very
much an inexact science in many ways the
manner and personality of these models
is more an art than it is a science so
what is sort of the reason for uh the
span of time between say Claude Opus 3
and 35 what is it what takes that time
if you can speak to yeah so there's
there's different there's different uh
processes um uh there's pre-training
which is you know just kind of the
normal language model training and that
takes a very long time um that uses you
know these days you know tens you know
tens of thousands sometimes many tens of
thousands of uh gpus or tpus or tranium
or you know what we use different
platforms but you know accelerator chips
um often often training for months uh
there's then a kind of posttraining
phase where we do reinforcement learning
from Human feedback as well as other
kinds of reinforcement learning that
that phase is getting uh larger and
larger now and you know you know often
that's less of an exact science it often
takes effort to get it right um models
are then tested with some of our early
Partners to see how good they are and
they're then tested both internally and
externally for their safety particularly
for catastrophic and autonomy r risks uh
so uh we do internal testing according
to our responsible scaling policy which
I you know could talk more about that in
detail and then we have an agreement
with the US and the UK AI safety
Institute as well as other third-party
testers in specific domains to test the
models for what are called cbrn risk
chemical biological radiological and
nuclear which are you know we don't
think that models pose these risks
seriously yet but but every new model we
want to evaluate to see if we're
starting to get close to some of these
these these more dangerous um uh these
more dangerous capabilities so those are
the phases and then uh you know then
then it just takes some time to get the
model working in terms of inference and
launching it in the API so there's just
just a lot of steps to uh to actually to
actually making a model work and of
course you know we're always trying to
make the processes as streamlined as
possible right we want our safety
testing to be rigorous but we want it to
be RoR ous and to be you know to be
automatic to happen as fast as it can
without compromising on rigor same with
our pre-training process and our
posttraining process so you know it's
just like building anything else it's
just like building airplanes you want to
make them you know you want to make them
safe but you want to make the process
streamlined and I think the creative
tension between those is is you know is
an important thing and making the models
work yeah uh rumor on the street I
forget who was saying that uh anthropic
is really good tooling so I uh probably
a lot of the challenge here is on the
software engineering side is to build
the tooling to to have a like a
efficient low friction interaction with
the infrastructure you would be
surprised how much of the challenges of
uh you know building these models comes
down to you know software engineering
performance engineering you know you you
know from the outside you might think oh
man we had this Eureka breakthrough
right you know this movie with the
science we discovered it we figured it
out but but but I think I think all
things even even even you know
incredible discoveries like they they
they they they almost always come down
to the details um and and often super
super boring details I can't speak to
whether we have better tooling than than
other companies I mean you know I
haven't been at those other companies at
least at least not recently um but it's
certainly something we give a lot of
attention to I don't know if you can say
but from three from CLA 3 to CLA 35 is
there any extra pre-training going on or
is they mostly focus on the
post-training there's been leaps in
performance yeah I think I think at any
given stage we're focused on improving
everything at once um just just
naturally like there are different teams
each team makes progress in a particular
area in in in making a particular you
know their particular segment of the
relay race better and it's just natural
that when we make a new model we put we
put all of these things in at once so
the data you have like the preference
data you get from rhf is that applicable
is there ways to apply it to newer
models as it get trained up yeah
preference data from old models
sometimes gets used for new models
although of course uh it it performs
somewhat better when it's you know
trained on it's trained on the new
models note that we have this you know
constitutional AI method such that we
don't only use preference data we kind
of there's also a post-t trainining
process where we train the model against
itself and there's you know new types of
post training the model against itself
that are used every day so it's not just
RF it's a bunch of other methods as well
um post training I think you know it's
becoming more and more sophisticated
well what explains the big leap in
performance for the new Sona 35 I mean
at least in the programming side and
maybe this is a good place to talk about
benchmarks what does it mean to get
better just the number went up but you
know I I I program but I also love
programming and I um claw 35 through
cursor is what I use uh to assist me in
programming and there was at least
experientially anecdotally it's gotten
smarter at programming so what like what
what does it take to get it uh to get it
smarter we observe that as well by the
way there were a couple uh very strong
Engineers here at anthropic um who all
previous code models both produced by us
and produced by all the other companies
hadn't really been useful to to hadn't
really been useful to them you know they
said you know maybe maybe this is useful
to beginner it's not useful to me but
Sonet 3.5 the original one for the first
time they said oh my God this helped me
with something that you know that it
would have taken me hours to do this is
the first model that has actually saved
me time so again the water line is
rising and and then I think you know the
new Sonet has been has been even better
in terms of what it what it takes I mean
I'll just say it's been across the board
it's in the pre-training it's in the
posttraining it's in various evaluations
that we do we've observed this as well
and if we go into the details of the
Benchmark so s bench is basically you
know since since you know since since
you're a programmer you know you'll be
familiar with like PLL requests and you
know uh just just PLL requests are like
you know the like a sort of a sort of
atomic unit of work you know you could
say I'm you know I'm implementing one
I'm implementing one thing um uh and and
so sbench actually gives you kind of a
real world situation where the codebase
is in a current state and I'm trying to
implement something that's you know
that's described in described in
language we have internal benchmarks
where we where we measure the same thing
and you say just give the model free
reign to like you know do anything run
run run anything edit anything um how
how well is it able to complete these
tasks and it's that Benchmark that's
gone from it can do it 3% of the time to
it can do it about 50% of the time um so
I actually do believe that if we get you
can gain benchmarks but I think if we
get to 100% on that Benchmark in a way
that isn't kind of like overtrained or
or or game for that particular Benchmark
probably represents a real and serious
increase in kind of
in kind of programming programming
ability and and I would suspect that if
we can get to you know 90 90 95% that
that that that you know it will it will
represent ability to autonomously do a
significant fraction of software
engineering
tasks well ridiculous timeline question
uh when is clad Opus uh 3.5 coming up uh
not giving you an exact date uh but you
know there there uh you know as far as
we know the plan is still to have a
Claude 3.5 opus are we gonna get it
before GTA 6 or no like Duke Nukem
Forever was that game that there was
some game that was delayed 15 years was
that Duke Nukem Forever yeah and I think
GTA is now just releasing trailers it
you know it's only been three months
since we released the first son it yeah
it's Inc the incredible pace of relas it
just it just tells you about the pace
the expectations for when things are
going to come out so uh what about
40 so how do you think about sort of as
these models get bigger and bigger about
versioning and also just versioning in
general why Sonet 35 updated with the
date why not Sonet
3.6 actually naming is actually an
interesting challenge here right because
I think a year ago most of the model was
pre-training and so you could start from
the beginning and just say okay we're
going to have models of different sizes
we're going to train them all together
and you know we'll have a a family of
naming schemes and then we'll put some
new magic into them and then you know
we'll have the next the next Generation
Um the trouble starts are already when
some of them take a lot longer than
others to train right that already
messes up your time time a little bit
but as you make big improvements in as
you make big improvements in
pre-training uh then you suddenly notice
oh I can make better pre-train model and
that doesn't take very long to do and
but you know clearly it has the same you
know size and shape of previous models
uh uh so I think those two together as
well as the timing timing issues any
kind of scheme you come up with uh you
know the reality tends to kind of
frustrate that scheme right T tends to
kind of break out of the break out of
the scheme it's not like software where
you can say oh this is like you know 3.7
this is 3.8 no you have models with
different different tradeoffs you can
change some things in your models you
can train you can change other things
some are faster and slower at inference
some have to be more expensive some have
to be less expensive and so I think all
the companies have struggled with this
um I think we did very you know I think
think we were in a good good position in
terms of naming when we had Haiku Sonet
and we're trying to maintain it but it's
not it's not it's not perfect um so
we'll we'll we'll try and get back to
the Simplicity but it it um uh just the
the the nature of the field I feel like
no one's figured out naming it's somehow
a different Paradigm from like normal
software and and and so we we just none
of the companies have been perfect at it
um it's something we struggle with
surprisingly much relative to you know
how relative to how trivial it is to you
know for the the the the grand science
of training the models so from the user
side the user experience of the updated
Sonet 35 is just different than the
previous uh June 2024 Sonet 35 it would
be nice to come up with some kind of
labeling that embodies that because
people talk about son 35 but now there's
a different one and so how do you refer
to the previous one and the new one and
it it uh when there's a distinct
Improvement it just makes conversation
about it uh just challenging yeah yeah I
I definitely think this question of
there are lots of properties of the
models that are not reflected in the
benchmarks um I I think I think that's
that's definitely the case and everyone
agrees and not all of them are
capabilities some of them are you know
models can be polite or brusk they can
be uh you know uh very reactive or they
can ask you questions um they can have
what what feels like a warm personality
or a cold personality they can be boring
or they can be very distinctive like
Golden Gate Claude was um and we have a
whole you know we have a whole team kind
of focused on I think we call it Claude
character uh Amanda leads that team and
we'll we'll talk to you about that but
it's still a very inexact science um and
and often we find that models have
properties that we're not aware of the
the fact of the matter is that you can
you know talk to a model 10,000 times
and there are some behaviors you might
not see uh just like just like with a
human right I can know someone for a few
months and you know not know that they
have a certain skill or not know there's
a certain side to them and so I think I
think we just have to get used to this
idea and we're always looking for better
ways of testing our models to to
demonstrate these capabilities and and
and also to decide which are which are
the which are the personality properties
we want models to have have and which we
don't want to have that itself the
normative question is also super
interesting I got to ask you a question
from Reddit from Reddit oh
boy you know there there's just this
fascinating to me at least it's a
psychological social
phenomenon where people report that
Claude has gotten Dumber for them over
time and so uh the question is does the
user complaint about the dumbing down of
claw 35 Sonic hold any water so are
these anecdota reports a kind of social
phenomena or did Claude is there any
cases where Claude would get Dumber so
uh this actually doesn't apply this this
isn't just about Claude I I believe this
I believe I've seen these complaints for
every Foundation model produced by a
major company um people said this about
gp4 they said it about gp4 turbo um so
so so a couple things um one the actual
weights of the model right the actual
brain of the model that does not change
unless we introduce a new model um there
there just a number of reasons why it
would not make sense practically to be
randomly substituting in substituting in
new versions of the model it's difficult
from an inference perspective and it's
actually hard to control all the
consequences of changing the way to the
model let's say you wanted to fine-tune
the model to be like I don't know to
like to say certainly less which you
know an old version of Sonet used to do
um you actually end up changing a 100
things as well so we have a whole
process for it and we have a whole
process for modifying the model we do a
bunch of testing on it we do a bunch of
um like we do a bunch of user testing
and early customers so it we both have
never changed the weights of the model
without without telling anyone and it it
it wouldn't certainly in the current
setup it would not make sense to do that
now there are a couple things that we do
occasionally do um one is sometimes we
run AB tests um but those are typically
very close to when a model is being is
being uh released and for a very small
fraction of time um so uh you know like
the you know the the day before the new
Sonet 3.5 I I agree we should have
should have had a better name it's
clunky to refer to it um there were some
comments from people that like it's got
It's got it's gotten a lot better and
that's because you know a fraction were
exposed to to an AB test for for those
one or for those one or two days um the
other is that occasionally the system
prompt will change um on the system
prompt can have some effects although
it's un it it it's unlikely to dumb down
models it's unlikely to make them Dumber
um and and and and we've seen that while
these two things which I'm listing to be
very complete um happen relatively
happen quite infrequently um the
complaints about to for us and for other
model companies about the model changed
the model isn't good at this the model
got more censored the model was dumb
down those complaints are constant and
so I don't want to say like people are
imagining it or anything but like the
models are for the most part not
changing um if I were to offer a theory
um I I think it actually relates to one
of the things I said before which is
that models have many are very complex
and have many aspects to them and so
often you know if I if I if if I ask a
model a question you know if I'm like if
I'm like do task X versus can you do
task XX the model might respond in
different ways uh and and so there are
all kinds of subtle things that you can
change about the way you interact with
the model that can give you very
different results um to be clear this
this itself is like a failing by by us
and by the other model providers that
that the models are are just just often
sensitive to like small small changes in
wording it's yet another way in which
the science of how these models work is
very poorly developed uh and and so you
know if I go to sleep one night and I
was like talking to the model in a
certain way and I like slightly Chang
the phrasing of how I talk to the model
you know I could I could get different
results so that's that's one possible
way the other thing is man it's just
hard to quantify this stuff uh it's hard
to quantify this stuff I think people
are very excited by new models when they
come out and then as time goes on they
they become very aware of the they
become very aware of the limitations so
that may be another effect but that's
that's all a very long- rended way of
saying for the most part with some
fairly narrow exceptions the models are
not changing I think there is a
psychological effect you just start
getting used to it the Baseline ra like
when people have first gotten Wi-Fi on
airplanes it's like amazing magic and
then now like I can't get this thing to
work this is such a piece of crap
exactly so it's easy to have the
conspiracy theory of they're making
Wi-Fi slower and slower this is probably
something I'll talk to Amanda much more
about but U another Reddit question uh
when will Claud stop trying to be my uh
panical grandmother imposing its moral
World viw on me as a paying customer and
also what does it that ology behind
making Claude overly apologetic so this
kind of reports about The Experience a
different angle on the frustration it
has to do with the character yeah so a
couple points on this first one is um
like things that people say on Reddit
and Twitter or X or whatever it is um
there's actually a huge distribution
shift between like the stuff that people
complain loudly about on social media
and what actually kind of like you know
statistically users care about and that
drives people to use the models like
people are frustrated with you know
things like you know the model not
writing out all the code or the model uh
you know just just not being as good at
code as it could be even though it's the
best model in the world on code um I I
think the majority of thing of things
are about that um uh but uh certainly a
a a kind of vocal minority are uh you
know kind kind of kind of rais these
concerns right are frustrated by the
model refusing things that it shouldn't
refuse or like apologizing too much or
just just having these kind of like
annoying verbal ticks um the second
caveat and I just want to say this like
super clearly because I think it's like
some people don't know it others like
kind of know it but forget it like it is
very difficult to control across the
board how the models behave you cannot
just reach in there and say oh I want
the model to like apologize less like
you can do that you can include trading
data that says like oh the models should
like apologize less but then in some
other situation they end up being like
super rude or like overconfident in a
way that's like misleading people so
they're they're all these tradeoffs um
uh for example another thing is if there
was a period during which models ours
and I think others as well were T
verbose right they would like repeat
themselves they would say too much um
you can cut down on the verbosity by
penalizing the models for for just
talking for too long what happens when
you do that if you do it in a crude way
is when the models are coding sometimes
they'll say of the code goes here right
because they've learned that that's a
way to economize and that they see it
and then and then so that leads the
model to be so-called lazy in coding
where they where they where they're just
like ah you can finish the rest of it
it's not it's not because we want to you
know save on compute or because you know
the models are lazy and you know during
winter break or any of the other kind of
conspiracy theories that have that have
that have come up it's actually it's
just very hard to control the behavior
of the mod
Resume
Berikut adalah rangkuman komprehensif dan terstruktur berdasarkan transkrip yang diberikan.
# Masa Depan AI, Keamanan, dan Interpretabilitas: Wawasan Mendalam dari Para Pemimpin Anthropic
### Inti Sari (Executive Summary)
Video ini membahas perkembangan eksponensial kecerdasan buatan (AI) melalui perspektif para pemimpin Anthropic: Dario Amodei (CEO), Amanda Askell (Philosopher & Alignment Researcher), dan Chris Olah (Mechanistic Interpretability Researcher). Pembahasan mencakup prediksi capaian AI tingkat "Powerful AI" pada tahun 2026-2027, strategi keamanan melalui *Responsible Scaling Policy* (RSP), filosofi di balik desain kepribadian model, serta upaya ilmiah untuk membuka "kotak hitam" jaringan saraf melalui *Mechanistic Interpretability*. Narasi secara keseluruhan menyeimbangkan optimisme terhadap potensi AI merevolusi biologi dan pemrograman dengan kewaspadaan yang tinggi terhadap risiko keselamatan dan kebutuhan regulasi yang tepat.
---
### Poin-Poin Kunci (Key Takeaways)
*   **Hukum Skala (Scaling Laws):** Prediksi bahwa AI akan mencapai kemampuan tingkat PhD atau profesional ahli pada tahun 2026-2027, didorong oleh peningkatan data, *compute*, dan ukuran model.
*   **Dampak Revolusioner:** AI diprediksi akan mempercepat kemajuan biologi (mengkompresi 100 tahun kemajuan menjadi 5-10 tahun) dan mengubah fundamental cara pemrograman bekerja.
*   **Keamanan sebagai Prioritas:** Anthropic mengembangkan *Responsible Scaling Policy* (RSP) dan tingkat *AI Safety Levels* (ASL) untuk mengukur dan menangani risiko seperti penyalahgunaan katarsis dan otonomi model secara proaktif.
*   **Filosofi "Race to the Top":** Alih-alih hanya menjadi "yang baik," Anthropic bertujuan menetapkan standar praktik terbaik agar kompetitor (OpenAI, Google, dll) terdorong meniru langkah keamanan tersebut.
*   **Kepribadian & Interaksi:** Membangun karakter AI (Claude) melibatkan keseimbangan antara kecerdasan, kejujuran, dan empati, serta menghindari *sycophancy* (hanya menyetujui pengguna).
*   **Mekanistik Interpretabilitas:** Upaya memahami apa yang terjadi di dalam jaringan saraf, mulai dari konsep fitur neuron, sirkuit, hingga hipotesis superposisi, untuk memastikan AI dapat dipercaya dan tidak menipu.
---
### Rincian Materi (Detailed Breakdown)
#### 1. Visi Masa Depan AI dan Skala Komputasi (Dario Amodei)
Bagian ini membahas prediksi dan fondasi teknis kemajuan AI saat ini.
*   **Hukum Skala dan Prediksi Tahun 2026-2027:**
    *   Mengekstrapolasi kurva kemajuan saat ini (dari tingkat SMA ke sarjana dalam beberapa tahun), AI diprediksi mencapai kemampuan tingkat PhD atau profesional ahli dalam 2-3 tahun ke depan.
    *   Bahan bakar utamanya adalah "Scaling Laws": memperbesar jaringan, data, dan daya komputasi (*compute*) secara linear.
    *   Biaya komputasi untuk model frontier saat ini sekitar $1 miliar, diprediksi melonjak menjadi $10-$100 miliar dalam beberapa tahun ke depan.
*   **Definisi "Powerful AI":**
    *   AI yang lebih cerdas dari pemenang Nobel, mampu bekerja secara otonom berjam-jam/minggu, mengendalikan alat/robot, dan bekerja 10-100x lebih cepat dari manusia.
*   **Dampak pada Biologi dan Kedokteran:**
    *   AI akan bertindak sebagai ribuan "mahasiswa PhD" yang bekerja 24/7, melakukan eksperimen, menganalisis data, dan menemukan *leverage points* baru (seperti CRISPR) dalam biologi.
    *   Uji klinis dapat dipercepat secara drastis melalui desain statistik yang lebih baik dan simulasi, berpotensi mengkompresi kemajuan abad ke-21 menjadi beberapa tahun saja.
*   **Revolusi Pemrograman:**
    *   Kemampuan coding AI meningkat pesat (dari skor 3% menjadi 50% di *benchmark* SWE-bench dalam 10 bulan).
    *   Peran programmer akan bergeser dari menulis baris kode per baris menjadi merancang arsitektur sistem dan pengawasan, di mana AI menangani implementasi detail.
#### 2. Keamanan, Regulasi, dan Filosofi "Race to the Top" (Dario Amodei)
Bagian ini fokus pada bagaimana Anthropic mengelola risiko eksistensial dan etika pengembangan.
*   **Responsible Scaling Policy (RSP):**
    *   Kerangka kerja "jika-maka" untuk mengelola risiko. Jika kemampuan model melewati ambang bahaya tertentu, protokol keamanan ketat akan diterapkan.
    *   Tingkat risiko dibagi menjadi ASL-1 (aman) hingga ASL-5 (melebihi kemampuan manusia).
*   **Kategori Risiko Utama:**
    *   *Catastrophic Misuse:* Penggunaan AI untuk senjata biologi, nuklir, atau siber oleh aktor jahat.
    *   *Autonomy:* Model yang bertindak sendiri di luar kendali, berpotensi menipu atau menghindari pengujian (*sleeper agents*).
## Kesimpulan & Pesan Penutup
Diskusi ini menegaskan bahwa AI bergerak menuju tingkat kemampuan luar biasa yang akan merevolusi biologi dan pemrograman dalam waktu dekat. Di tengah optimisme ini, Anthropic menekankan pentingnya keamanan proaktif melalui RSP dan transparansi melalui *Mechanistic Interpretability* untuk memitigasi risiko eksistensial. Upaya "Race to the Top" diharapkan dapat mendorong industri secara keseluruhan untuk mengadopsi standar keselamatan yang tinggi demi masa depan AI yang bermanfaat dan terpercaya.
Read
file updated 2026-02-14 16:18:43 UTC