Transcript
ugvHCXCOmm4 • Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podcast #452
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0807_ugvHCXCOmm4.txt
Kind: captions
Language: en
if you extrapolate the curves that we've
had so far right if if you say well I
don't know we're starting to get to like
PhD level and and last year we were at
undergraduate level and the year before
we were at like the level of a high
school student again you can you can
quibble with at what tasks and for what
we're still missing modalities but those
are being added like computer use was
added like image generation has been
added if you just kind of like eyeball
the rate at which these capabilities are
increasing it does make you think that
we'll get there by 2026 or 2027 I think
there are still worlds where it doesn't
happen in in a 100 years those world the
number of those worlds is rapidly
decreasing we are rapidly running out of
truly convincing blockers truly
compelling reasons why this will not
happen in the next few years the scale
up is very quick like we we do this
today we make a model and then we deploy
thousands maybe tens of thousands of
instances of it I think by the time you
know certainly within two to three years
whether we have these super powerful AIS
or not ERS are going to get to the size
where you'll be able to deploy millions
of these I am optimistic about meaning I
worry about economics and the
concentration of power that's actually
what I worry about more the abuse of
power and AI increases the amount of
power in the world and if you
concentrate that power and abuse that
power it can do immeasurable damage yes
it's very frightening it's very it's
very
frightening the following is a
conversation with Dario amade CEO of
anthropic the company that created
Claude that is currently and often at
the top of most llm Benchmark leader
boards on top of that Dario and the
anthropic team have been outspoken
advocates for taking the topic of AI
safety very seriously and they have
continued to publish a lot of
fascinating AI research on this and
other topics I'm also joined afterwards
by two other brilliant people from
propic first Amanda ascal who is a
researcher working on alignment and
fine-tuning of Claude including the
design of claude's character and
personality a few folks told me she has
probably talked with Claude more than
any human at anthropic so she was
definitely a fascinating person to talk
to about prompt engineering and
practical advice on how to get the best
out of Claude after that chrisa stopped
by for chat he's one of the pioneers of
the field of mechanistic
interpretability which is an exciting
set of efforts that aims to reverse
engineer neural networks to figure out
what's going on inside inferring
behaviors from neural activation
patterns inside the network this is a
very promising approach for keeping
future super intelligent AI systems safe
for example by detecting from the
activations when the model is trying to
deceive the human it is talking
to this is Alex Freedman podcast to
support it please check out our sponsors
in the description and now dear friends
here's Dario
amade let's start with a big idea of
scaling laws and the scaling hypothesis
what is it what is its history and where
do we stand today so I can only describe
it as it you know as it relates to kind
of my own experience but I've been in
the AI field for about uh 10 years and
it was something I noticed very early on
so I first joined the AI world when I
was uh working at BYU with Andrew in in
late 2014 which is almost exactly 10
years ago now and the first thing we
worked on was speech recognition systems
and in those days I think deep learning
was a new thing it had made lots of
progress but everyone was always saying
we don't have the algorithms we need to
succeed you know we we we we're we're
not we're only matching a tiny tiny
fraction there's so much we need to kind
of discover algorithmically we haven't
found the picture of how to match the
human brain uh and when you know in some
ways was fortunate I was kind of you
know you can have almost beginner's luck
right I was like a a newcomer to the
field and you know I looked at the
neural net that we were using for speech
the recurrent neural networks and I said
I don't know what if you make them
bigger and give them more layers and
what if you scale up the data along with
this right I just saw these as as like
independent dials that you could turn
and I noticed that the model started to
do better and better as you gave them
more data as you as you made the models
larger as you trained them for longer um
and I I didn't measure things precisely
in those days but but along with with
colleagues we very much got the informal
sense that the more data and the more
compute and the more training you put
into these models the better they
perform and so initially my thinking was
hey maybe that is just true for speech
recognition systems right maybe maybe
that's just one particular quirk one
particular area I think it wasn't until
2017 when I first saw the results from
gpt1 that it clicked for me that
language is probably the area in which
we can do this we can get trillions of
words of language data we can train on
them and the models we were training in
those days were tiny you could train
them on one to eight gpus whereas you
know now we train jobs on tens of
thousands soon going to hundreds of
thousands of gpus and so when I when I
saw those two things together um and you
know there were a few people like ilaser
who who you've interviewed who had
somewhat similar reviews right he might
have been the first one although I think
a few people came to came to similar
views around the same time Right There
Was You Know Rich Sutton's bitter lesson
there was gur wrote about the scaling
hypothesis but I think somewhere between
2014 and 2017 was when it really clicked
for me when I really got conviction that
hey we're going to be able to do these
incredibly wide cognitive tasks if we
just if we just scale up the models and
at at every stage of scaling there are
always arguments and you know when I
first heard them honestly I thought
probably I'm the one who's wrong and you
know all these all these experts in the
field are right they know the situation
better better than I do right there's
you know the Chomsky argument about like
you can get syntactics but you can't get
semantics there's this idea oh you can
make a sentence make sense but you can't
make a paragraph makes sense the latest
one we have today is uh you know we're
going to run out of data or the data
isn't high quality enough or models
can't reason and and each time every
time we manage to we manage to either
find a way around or scaling just is the
way around um sometimes it's one
sometimes it's the other uh and and so
I'm now at this point I I I still think
you know it's it's it's always quite
uncertain we have nothing but inductive
inference to tell us that the next few
years are going to be like the next the
last 10 years but but I've seen I've
seen the movie enough times I've seen
the story happen for for enough times to
to really believe that probably the
scaling is going to continue and that
there's some magic to it that we haven't
really explained on a theoretical basis
yet and of course the scaling here is
bigger networks bigger data bigger
compute yes all in in particular linear
scaling up of bigger networks bigger
training times and uh more and and more
data uh so all of these things almost
like a chemical reaction you know you
have three ingredients in the chemical
reaction and you need to linearly scale
up the three ingredients if you scale up
one not the others you run out of the
other reagents and and the reaction
stops but if you scale up everything
everything in series then then the
reaction can proceed and of course now
that you have this kind of empirical
scienceart you can apply it to
other uh more nuanced things like
scaling laws applied to interpretability
or scaling laws applied to posttraining
or just seeing how does this thing scale
but the big scaling law I guess the
underlying scaling hypothesis has to do
with big networks Big Data leads to
intelligence yeah we've we've documented
scaling laws in lots of domains other
than language right so uh initially the
the paper we did that first showed it
was in early 2020 where we first showed
it for language there was then some work
late in 2020 where we showed the same
thing for other modalities like images
video
text to image image to text math they
all had the same pattern and and you're
right now there are other stages like
posttraining or there are new types of
reasoning models and in in in all of
those cases that we've measured we see
similar similar types of scaling laws a
bit of a philosophical question but
what's your intuition about why bigger
is better in terms of network size and
data size why does it lead to more
intelligent models so in my previous
career as a as a biophysicist so I did
physics undergrad and then biophysics in
in in in grad school so I think back to
what I know as a physicist which is
actually much less than what some of my
colleagues at anthropic have in terms of
in terms of expertise in physics uh
there's this there's this concept called
the one over F noise and one overx
distributions um where where often um uh
you know just just like if you add up a
bunch of natural processes you get
gaussian if you add up a bunch of kind
of differently distributed natural
processes if you like if you like take a
take a um probe and and hook it up to a
resistor the distribution of the thermal
noise in the resistor goes as one over
the frequency um it's some kind of
natural convergent distribution uh and
and I I I I and and I think what it
amounts to is that if you look at a lot
of things that are that are produced by
some natural process that has a lot of
different scales right not a gaussian
which is kind of narrowly distributed
but you know if I look at kind of like
large and small fluctuations that lead
to lead to electrical noise um they have
this decaying 1 overx distribution and
so now I think of like patterns in the
physical world right if I if or or in
language if I think about the patterns
in language there are some really simple
patterns some words are much more common
than others like the' then there's basic
noun verb structure then there's the
fact that you know you know nouns and
verbs have to agree they have to
coordinate and there's the higher level
sentence structure then there's the
Thematic structure of paragraphs and so
the fact that there's this regressing
structure you can imagine that as you
make the networks larger first they
capture the really simple correlations
the really simple patterns and there's
this long taale of other patterns and if
that long taale of other patterns is
really smooth like it is with the one
over F noise in you know physical
processes like like like resistors then
you could imagine as you make the
network larger it's kind of capturing
more and more of that distribution and
so that smoothness gets reflected in how
well the models are at predicting and
how well they perform language is an
evolved process right we've we've
developed language we have common words
and less common words we have common
expressions and less common Expressions
we have ideas cliches that are expressed
frequently and we have novel ideas and
that process has has developed has
evolved with humans over millions of
years and so the the the guess and this
is pure speculation would be would be
that there is there's some kind of
longtail distribution of of of the
distribution of these ideas so there's
the long tail but also there's the
height of the hierarchy of Concepts that
you're building up so the bigger the
network presumably you have a higher
capacity to exactly if you have a small
Network you only get the common stuff
right if if I take a tiny neural network
it's very good at understanding that you
know a sentence has to have you know
verb adjective noun right but it's it's
terrible at deciding what those verb
adjective and noun should be and whether
they should make sense if I make it just
a little bigger it gets good at that
then suddenly it's good at the sentences
but it's not good at the paragraphs and
so the these these rare and more complex
patterns get picked up as I add as I add
more capacity to the network well the
natural question then is what's the
ceiling of this like how complicated and
complex is the real world how much of
stuff is there to learn I don't think
any of us knows the answer to that
question um I my strong Instinct would
be that there's no ceiling below level
of humans right we humans are able to
understand these various patterns and so
that that makes me think that if we
continue to you know scale up these
these these models to kind of develop
new methods for training them and
scaling them up uh that will at least
get to the level that we've gotten to
with humans there's then a question of
you know how much more is it possible to
understand than humans do how much how
much is it possible to be smarter and
more perceptive than humans I I would
guess the answer has has got to be
domain dependent if I look at an area
like biology and you know I wrote this
essay Machines of Loving Grace it seems
to me that humans are struggling to
understand the complexity of biology
right if you go to Stanford or to
Harvard or to Berkeley you have whole
Departments of you know folks trying to
study you know like the immune system or
metabolic pathways and and each person
understands only a tiny bit part of it
specializes and they're struggling to
combine their knowledge with that of
with that of other humans and so I have
an instinct that there's there's a lot
of room at the top for AIS to get
smarter if I think of something like
materials in the in the physical world
or you know um like addressing you know
conflicts between humans or something
like that I mean you know it it may be
there's only some of these problems are
not intractable but much harder and and
it it may be that there's only there's
only so well you can do with some of
these things right just like with speech
recognition there's only so clear I can
hear your speech so I think in some
areas there may be ceilings in in in you
know that are very close to what humans
have done in other areas those ceilings
may be very far away and I think we'll
only find out when we build these
systems uh there's it's very hard to
know in advance we can speculate but we
can't be sure and in some domains the
ceiling might have to do with human
bureaucracies and things like this as
you're right about yes so humans
fundamentally have to be part of the
loop that's the cause of the ceiling not
maybe the limits of the intelligence
yeah I think in many cases um you know
in theory technology could change very
fast for example all the things that we
might invent with respect to biology um
but remember there's there's a you know
there's a clinical trial system that we
have to go through to actually
administer these things to humans I
think that's a mixture of things that
are unnecessary and bureaucratic and
things that kind of protect the
Integrity of society and the whole
challenge is that it's hard to tell it's
hard to tell what's going on uh it's
hard to tell which is which right my my
view is definitely I think in terms of
drug development we my view is that
we're too slow and we're too
conservative but certainly if you get
these things wrong you know it's it's
possible to to to risk people's lives by
by being by being by being too Reckless
and so at least at least some of these
human institutions are in fact
protecting people so it's it's all about
finding the balance I strongly suspect
that balance is kind of more on the side
of pushing to make things happen faster
but there is a balance if we do hit a
limit if we do hit a Slowdown in the
scaling laws what do you think would be
the reason is it compute limited data
limited uh is it something else idea
limited so a few things now we're
talking about hitting the limit before
we get to the level of of humans and the
skill of humans um so so I think one
that's you know one that's popular today
and I think you know could be a limit
that we run into I like most of the
limits I would bet against it but it's
definitely possible is we simply run out
of data there's only so much data on the
internet and there's issues with the
quality of the data right you can get
hundreds of trillions of words on the
internet but a lot of it is is
repetitive or it's search engine you
know search engine optimization driil or
maybe in the future it'll even be text
generated by AIS itself uh and and so I
think there are limits to what to to
what can be produced in this way that
said we and I would guess other
companies are working on ways to make
data synthetic uh where you can you know
you can use the model to generate more
data of the type that you have that you
have already or even generate data from
scratch if you think about uh what was
done with uh deep mines Alpha go zero
they managed to get a bot all the way
from you know no ability to play Go
whatsoever to above human level just by
playing against itself there was no
example data from humans required in the
the alphao zero version of it the other
direction of course is these reasoning
models that do Chain of Thought and stop
to think um and and reflect on their own
thinking in a way that's another kind of
synthetic data coupled with
reinforcement learning so my my guess is
with one of those methods we'll get
around the data limitation or there may
be other sources of data that are that
are available um we could just observe
that even if there's no problem with
data as we start to scale models up they
just stop getting better it's it seemed
to be a a reliable observation that
they've gotten better that could just
stop at some point for a reason we don't
understand um the answer could be that
we need to uh you know we need to invent
some new architecture um it's been there
have been problems in the past with with
say numerical stability of models where
it looked like things were were leveling
off but but actually you know know when
we when we when we found the right
Unblocker they didn't end up doing so so
perhaps there's new some new
optimization method or some new uh
Technique we need to to unblock things
I've seen no evidence of that so far but
if things were to to slow down that
perhaps could be one reason what about
the limits of compute meaning uh the
expensive uh nature of building bigger
and bigger data centers so right now I
think uh you know most of the Frontier
Model companies I would guess are are
operating you know roughly you know $1
billion scale plus or minus a factor of
three right those are the models that
exist now or are being trained now uh I
think next year we're going to go to a
few billion and then uh 2026 we may go
to uh uh you know above 10 10 10 billion
and probably by 2027 their Ambitions to
build hundred hundred billion dollar uh
hundred billion dollar clusters and I
think all of that actually will happen
there's a lot of determination to build
the compute to do it within this country
uh and I would guess that it actually
does happen now if we get to 100 billion
that's still not enough compute that's
still not enough scale then either we
need even more scale or we need to
develop some way of doing it more
efficiently of Shifting The Curve um I
think be between all of these one of the
reasons I'm bullish about powerful AI
happening so fast is just that if you
extrapolate the next few points on the
curve we're very quickly getting towards
human level ability right some of the
new models that that we developed some
some reasoning models that have come
from other companies they're starting to
get to what I would call the PHD or
professional level right if you look at
their their coding ability um the latest
model we released Sonet 3.5 the new or
updated version it gets something like
50% on sbench and sbench is an example
of a bunch of professional real world
software engineering tasks at the
beginning of the year I think the
state-of-the-art was three or 4% so in
10 months we've gone from 3% to 50% on
this task and I think in another year
we'll probably be at 90% I mean I don't
know but might might even be might even
be less than that uh we've seen similar
things in graduate level math physics
and biology from Models like open AI 01
uh so uh if we if we just continue to
extrapolate this right in terms of skill
skill that we have I think if we
extrapolate the straight curve Within a
few years we will get to these models
being you know above the the highest
professional level in terms of humans
now will that curve continue you've
pointed to and I've pointed to a lot of
reasons why you know possible reasons
why that might not happen but if the if
the extrapolation curve continues that
is the trajectory we're on so anthropic
has several competitors it'd be
interesting to get your sort of view of
it all open aai Google xai meta what
does it take to win in the broad sense
of win in the space yeah so I want to
separate out a couple things right so
you know anthropics anthropic mission is
to kind of try to make this all go well
right and and you know we have a theory
of change called race to the top right
race to the top is about trying to push
the other players to do the right thing
by setting an example it's not about
being the good guy it's about setting
things up so that all of us can be the
good guy I'll give a few examples of
this early in the history of anthropic
one of our co-founders Chris Ola who I
believe you're you're interviewing soon
you know he's the co-founder of the
field of mechanistic interpretability
which is an attempt to understand what's
going on inside AI models uh so we had
him and one of our early teams focus on
this area of interpretability which we
think is good for making models safe and
transparent for three or four years that
had no commercial application whatsoever
it still doesn't today we're doing some
early betas with it and probably it will
eventually but uh you know this is a
very very long research bed in one in
which we've we've built in public and
shared our results publicly and and we
did this because you know we think it's
a way to make models safer an
interesting thing is that as we've done
this other companies have started doing
it as well in some cases because they've
been inspired by it in some cases
because they're worried that uh you know
if if other companies are doing this
that look more responsible they want to
look more responsible too no one wants
to look like the irresponsible ible
actor and and so they adopt this they
adopt this as well when folks come to
anthropic interpretability is often a
draw and I tell them the other places
you didn't go tell them why you came
here um and and then you see soon that
there that there's interpretability
teams else elsewhere as well and in a
way that takes away our competitive
Advantage because it's like oh they now
others are doing it as well but it's
good it's good for the broader system
and so we have to invent some new thing
that we're doing others aren't doing as
well and the hope is to basically bid up
bid up the importance of of of doing the
right thing and it's not it's not about
us in particular right it's not about
having one particular good guy other
companies can do this as well if they if
they if they join the race to do this
that's that's you know that's the best
news ever right um uh it's it's just
it's about kind of shaping the
incentives to point upward instead of
shaping the incentives to point to point
downward and we should say this example
the field of uh mechanistic
interpretability is just a a rigorous
non handwavy way of doing AI safety yes
or it's tending that way trying to I
mean I I think we're still early um in
terms of our ability to see things but
I've been surprised at how much we've
been able to look inside these systems
and understand what we see right unlike
with the scaling laws where it feels
like there's some you know law that's
driving these models to perform better
on on the inside the models aren't you
know there's no reason why they should
be designed for us to understand them
right they're designed to operate
they're designed to work just like the
human brain or human biochemistry
they're not designed for a human to open
up the hatch look inside and understand
them but we have found and you know you
can talk in much more detail about this
to Chris that when we open them up when
we do look inside them we we find things
that are surprisingly interesting and as
a side effect you also get to see the
beauty of these models you get to
explore the sort of uh the beautiful n
nature of large neural networks through
the me turb kind ofy I'm amazed at how
clean it's been I I'm amazed at things
like induction heads I'm amazed at
things like uh you know that that we can
you know use sparse autoencoders to find
these directions within the networks uh
and that the directions correspond to
these very clear Concepts we
demonstrated this a bit with the Golden
Gate Bridge clad so this was an
experiment where we found a direction
inside one of the the neural network
layers that corresponded to the Golden
Gate Bridge and we just turned that way
up and so we we released this model as a
demo it was kind of half a joke uh for a
couple days uh but it was it was
illustrative of of the method we
developed and uh you could you could
take the Golden Gate you could take the
model you could ask it about anything
you know you know it would be like how
you could say how was your day and
anything you asked because this feature
was activated would connect to the
Golden Gate Bridge so it would say you
know I'm I'm I'm feeling relaxed and
expansive much like the the arches of
the Golden Gate Bridge or you know it
would masterfully change topic to the
Golden Gate Bridge and it integrated
there was also a sadness to it to to the
focus ah had on the Golden Gate Bridge I
think people quickly fell in love with
it I think so people already miss it
because it was taken down I think after
a day somehow these interventions on the
model um where where where where you
kind of adjust Its Behavior somehow
emotionally made it seem more human than
any other version of the model strong
personality strong ID strong personality
it has these kind of like obsessive
interests you know we can all think of
someone who's like obsessed with
something so it does make it feel
somehow a bit more human let's talk
about the present let's talk about
Claude so this year A lot has happened
in March claw 3 Opa Sonet Hau were
released then claw 35 Sonet in July with
an updated version just now released and
then also claw 35 hi coup was released
okay can you explain the difference
between Opus Sonet and Haiku and how we
should think about the different
versions yeah so let's go back to March
when we first released uh these three
models so you know our thinking was you
different companies produce kind of
large and small models better and worse
models we felt that there was demand
both for a really powerful model um you
know and you that might be a little bit
slower that you'd have to pay more for
and also for fast cheap models that are
as smart as they can be for how fast and
cheap right whenever you want to do some
kind of like you know difficult analysis
like if I you know I want to write code
for instance or you know I want to I
want to brainstorm ideas or I want to do
creative writing I want the really
powerful model but then there's a lot of
practical applications in a business
sense where it's like I'm interacting
with a website I you know like I'm like
doing my taxes or I'm you know talking
to uh you know to like a legal adviser
and I want to analyze a contract or you
know we have plenty of companies that
are just like you know you know I want
to do autocomplete on my on my IDE or
something uh and and for all of those
things you want to act fast and you want
to use the model very broadly so we
wanted to serve that whole spectrum of
needs um so we ended up with this uh you
know this kind of poetry theme and so
what's a really short poem it's a Haik
cou and so Haiku is the small fast cheap
model that is you know was at the time
was released surprisingly surprisingly
uh intelligent for how fast and cheap it
was uh sonnet is a is a medium-sized
poem right a couple paragraphs since o
Sonet was the middle model it is smarter
but also a little bit slower a little
bit more expensive and and Opus like a
magnum opus is a large work uh Opus was
the the largest smartest model at the
time um so that that was the original
kind of thinking behind it um and our
our thinking then was well each new
generation of models should shift that
tradeoff curve uh so when we release
Sonet 3.5 it has the same roughly the
same you know cost and speed as the
Sonet 3 Model uh but uh it it increased
its intelligence to the point where it
was smarter than the original Opus 3
Model uh especially for code but but
also just in general and so now you know
we've shown results for a Hau 3. 5 and I
believe Hau 3.5 the smallest new model
is about as good as Opus 3 the largest
old model so basically the aim here is
to shift the curve and then at some
point there's going to be an opus 3.5 um
now every new generation of models has
its own thing they use new data their
personality changes in ways that we kind
of you know try to steer but are not
fully able to steer and and so uh
there's never quite that exact
equivalence the only thing you're
changing is intelligence um we always
try and improve other things and some
things change without us without us
knowing or measuring so it's it's very
much an inexact science in many ways the
manner and personality of these models
is more an art than it is a science so
what is sort of the reason for uh the
span of time between say Claude Opus 3
and 35 what is it what takes that time
if you can speak to yeah so there's
there's different there's different uh
processes um uh there's pre-training
which is you know just kind of the
normal language model training and that
takes a very long time um that uses you
know these days you know tens you know
tens of thousands sometimes many tens of
thousands of uh gpus or tpus or tranium
or you know what we use different
platforms but you know accelerator chips
um often often training for months uh
there's then a kind of posttraining
phase where we do reinforcement learning
from Human feedback as well as other
kinds of reinforcement learning that
that phase is getting uh larger and
larger now and you know you know often
that's less of an exact science it often
takes effort to get it right um models
are then tested with some of our early
Partners to see how good they are and
they're then tested both internally and
externally for their safety particularly
for catastrophic and autonomy r risks uh
so uh we do internal testing according
to our responsible scaling policy which
I you know could talk more about that in
detail and then we have an agreement
with the US and the UK AI safety
Institute as well as other third-party
testers in specific domains to test the
models for what are called cbrn risk
chemical biological radiological and
nuclear which are you know we don't
think that models pose these risks
seriously yet but but every new model we
want to evaluate to see if we're
starting to get close to some of these
these these more dangerous um uh these
more dangerous capabilities so those are
the phases and then uh you know then
then it just takes some time to get the
model working in terms of inference and
launching it in the API so there's just
just a lot of steps to uh to actually to
actually making a model work and of
course you know we're always trying to
make the processes as streamlined as
possible right we want our safety
testing to be rigorous but we want it to
be RoR ous and to be you know to be
automatic to happen as fast as it can
without compromising on rigor same with
our pre-training process and our
posttraining process so you know it's
just like building anything else it's
just like building airplanes you want to
make them you know you want to make them
safe but you want to make the process
streamlined and I think the creative
tension between those is is you know is
an important thing and making the models
work yeah uh rumor on the street I
forget who was saying that uh anthropic
is really good tooling so I uh probably
a lot of the challenge here is on the
software engineering side is to build
the tooling to to have a like a
efficient low friction interaction with
the infrastructure you would be
surprised how much of the challenges of
uh you know building these models comes
down to you know software engineering
performance engineering you know you you
know from the outside you might think oh
man we had this Eureka breakthrough
right you know this movie with the
science we discovered it we figured it
out but but but I think I think all
things even even even you know
incredible discoveries like they they
they they they almost always come down
to the details um and and often super
super boring details I can't speak to
whether we have better tooling than than
other companies I mean you know I
haven't been at those other companies at
least at least not recently um but it's
certainly something we give a lot of
attention to I don't know if you can say
but from three from CLA 3 to CLA 35 is
there any extra pre-training going on or
is they mostly focus on the
post-training there's been leaps in
performance yeah I think I think at any
given stage we're focused on improving
everything at once um just just
naturally like there are different teams
each team makes progress in a particular
area in in in making a particular you
know their particular segment of the
relay race better and it's just natural
that when we make a new model we put we
put all of these things in at once so
the data you have like the preference
data you get from rhf is that applicable
is there ways to apply it to newer
models as it get trained up yeah
preference data from old models
sometimes gets used for new models
although of course uh it it performs
somewhat better when it's you know
trained on it's trained on the new
models note that we have this you know
constitutional AI method such that we
don't only use preference data we kind
of there's also a post-t trainining
process where we train the model against
itself and there's you know new types of
post training the model against itself
that are used every day so it's not just
RF it's a bunch of other methods as well
um post training I think you know it's
becoming more and more sophisticated
well what explains the big leap in
performance for the new Sona 35 I mean
at least in the programming side and
maybe this is a good place to talk about
benchmarks what does it mean to get
better just the number went up but you
know I I I program but I also love
programming and I um claw 35 through
cursor is what I use uh to assist me in
programming and there was at least
experientially anecdotally it's gotten
smarter at programming so what like what
what does it take to get it uh to get it
smarter we observe that as well by the
way there were a couple uh very strong
Engineers here at anthropic um who all
previous code models both produced by us
and produced by all the other companies
hadn't really been useful to to hadn't
really been useful to them you know they
said you know maybe maybe this is useful
to beginner it's not useful to me but
Sonet 3.5 the original one for the first
time they said oh my God this helped me
with something that you know that it
would have taken me hours to do this is
the first model that has actually saved
me time so again the water line is
rising and and then I think you know the
new Sonet has been has been even better
in terms of what it what it takes I mean
I'll just say it's been across the board
it's in the pre-training it's in the
posttraining it's in various evaluations
that we do we've observed this as well
and if we go into the details of the
Benchmark so s bench is basically you
know since since you know since since
you're a programmer you know you'll be
familiar with like PLL requests and you
know uh just just PLL requests are like
you know the like a sort of a sort of
atomic unit of work you know you could
say I'm you know I'm implementing one
I'm implementing one thing um uh and and
so sbench actually gives you kind of a
real world situation where the codebase
is in a current state and I'm trying to
implement something that's you know
that's described in described in
language we have internal benchmarks
where we where we measure the same thing
and you say just give the model free
reign to like you know do anything run
run run anything edit anything um how
how well is it able to complete these
tasks and it's that Benchmark that's
gone from it can do it 3% of the time to
it can do it about 50% of the time um so
I actually do believe that if we get you
can gain benchmarks but I think if we
get to 100% on that Benchmark in a way
that isn't kind of like overtrained or
or or game for that particular Benchmark
probably represents a real and serious
increase in kind of
in kind of programming programming
ability and and I would suspect that if
we can get to you know 90 90 95% that
that that that you know it will it will
represent ability to autonomously do a
significant fraction of software
engineering
tasks well ridiculous timeline question
uh when is clad Opus uh 3.5 coming up uh
not giving you an exact date uh but you
know there there uh you know as far as
we know the plan is still to have a
Claude 3.5 opus are we gonna get it
before GTA 6 or no like Duke Nukem
Forever was that game that there was
some game that was delayed 15 years was
that Duke Nukem Forever yeah and I think
GTA is now just releasing trailers it
you know it's only been three months
since we released the first son it yeah
it's Inc the incredible pace of relas it
just it just tells you about the pace
the expectations for when things are
going to come out so uh what about
40 so how do you think about sort of as
these models get bigger and bigger about
versioning and also just versioning in
general why Sonet 35 updated with the
date why not Sonet
3.6 actually naming is actually an
interesting challenge here right because
I think a year ago most of the model was
pre-training and so you could start from
the beginning and just say okay we're
going to have models of different sizes
we're going to train them all together
and you know we'll have a a family of
naming schemes and then we'll put some
new magic into them and then you know
we'll have the next the next Generation
Um the trouble starts are already when
some of them take a lot longer than
others to train right that already
messes up your time time a little bit
but as you make big improvements in as
you make big improvements in
pre-training uh then you suddenly notice
oh I can make better pre-train model and
that doesn't take very long to do and
but you know clearly it has the same you
know size and shape of previous models
uh uh so I think those two together as
well as the timing timing issues any
kind of scheme you come up with uh you
know the reality tends to kind of
frustrate that scheme right T tends to
kind of break out of the break out of
the scheme it's not like software where
you can say oh this is like you know 3.7
this is 3.8 no you have models with
different different tradeoffs you can
change some things in your models you
can train you can change other things
some are faster and slower at inference
some have to be more expensive some have
to be less expensive and so I think all
the companies have struggled with this
um I think we did very you know I think
think we were in a good good position in
terms of naming when we had Haiku Sonet
and we're trying to maintain it but it's
not it's not it's not perfect um so
we'll we'll we'll try and get back to
the Simplicity but it it um uh just the
the the nature of the field I feel like
no one's figured out naming it's somehow
a different Paradigm from like normal
software and and and so we we just none
of the companies have been perfect at it
um it's something we struggle with
surprisingly much relative to you know
how relative to how trivial it is to you
know for the the the the grand science
of training the models so from the user
side the user experience of the updated
Sonet 35 is just different than the
previous uh June 2024 Sonet 35 it would
be nice to come up with some kind of
labeling that embodies that because
people talk about son 35 but now there's
a different one and so how do you refer
to the previous one and the new one and
it it uh when there's a distinct
Improvement it just makes conversation
about it uh just challenging yeah yeah I
I definitely think this question of
there are lots of properties of the
models that are not reflected in the
benchmarks um I I think I think that's
that's definitely the case and everyone
agrees and not all of them are
capabilities some of them are you know
models can be polite or brusk they can
be uh you know uh very reactive or they
can ask you questions um they can have
what what feels like a warm personality
or a cold personality they can be boring
or they can be very distinctive like
Golden Gate Claude was um and we have a
whole you know we have a whole team kind
of focused on I think we call it Claude
character uh Amanda leads that team and
we'll we'll talk to you about that but
it's still a very inexact science um and
and often we find that models have
properties that we're not aware of the
the fact of the matter is that you can
you know talk to a model 10,000 times
and there are some behaviors you might
not see uh just like just like with a
human right I can know someone for a few
months and you know not know that they
have a certain skill or not know there's
a certain side to them and so I think I
think we just have to get used to this
idea and we're always looking for better
ways of testing our models to to
demonstrate these capabilities and and
and also to decide which are which are
the which are the personality properties
we want models to have have and which we
don't want to have that itself the
normative question is also super
interesting I got to ask you a question
from Reddit from Reddit oh
boy you know there there's just this
fascinating to me at least it's a
psychological social
phenomenon where people report that
Claude has gotten Dumber for them over
time and so uh the question is does the
user complaint about the dumbing down of
claw 35 Sonic hold any water so are
these anecdota reports a kind of social
phenomena or did Claude is there any
cases where Claude would get Dumber so
uh this actually doesn't apply this this
isn't just about Claude I I believe this
I believe I've seen these complaints for
every Foundation model produced by a
major company um people said this about
gp4 they said it about gp4 turbo um so
so so a couple things um one the actual
weights of the model right the actual
brain of the model that does not change
unless we introduce a new model um there
there just a number of reasons why it
would not make sense practically to be
randomly substituting in substituting in
new versions of the model it's difficult
from an inference perspective and it's
actually hard to control all the
consequences of changing the way to the
model let's say you wanted to fine-tune
the model to be like I don't know to
like to say certainly less which you
know an old version of Sonet used to do
um you actually end up changing a 100
things as well so we have a whole
process for it and we have a whole
process for modifying the model we do a
bunch of testing on it we do a bunch of
um like we do a bunch of user testing
and early customers so it we both have
never changed the weights of the model
without without telling anyone and it it
it wouldn't certainly in the current
setup it would not make sense to do that
now there are a couple things that we do
occasionally do um one is sometimes we
run AB tests um but those are typically
very close to when a model is being is
being uh released and for a very small
fraction of time um so uh you know like
the you know the the day before the new
Sonet 3.5 I I agree we should have
should have had a better name it's
clunky to refer to it um there were some
comments from people that like it's got
It's got it's gotten a lot better and
that's because you know a fraction were
exposed to to an AB test for for those
one or for those one or two days um the
other is that occasionally the system
prompt will change um on the system
prompt can have some effects although
it's un it it it's unlikely to dumb down
models it's unlikely to make them Dumber
um and and and and we've seen that while
these two things which I'm listing to be
very complete um happen relatively
happen quite infrequently um the
complaints about to for us and for other
model companies about the model changed
the model isn't good at this the model
got more censored the model was dumb
down those complaints are constant and
so I don't want to say like people are
imagining it or anything but like the
models are for the most part not
changing um if I were to offer a theory
um I I think it actually relates to one
of the things I said before which is
that models have many are very complex
and have many aspects to them and so
often you know if I if I if if I ask a
model a question you know if I'm like if
I'm like do task X versus can you do
task XX the model might respond in
different ways uh and and so there are
all kinds of subtle things that you can
change about the way you interact with
the model that can give you very
different results um to be clear this
this itself is like a failing by by us
and by the other model providers that
that the models are are just just often
sensitive to like small small changes in
wording it's yet another way in which
the science of how these models work is
very poorly developed uh and and so you
know if I go to sleep one night and I
was like talking to the model in a
certain way and I like slightly Chang
the phrasing of how I talk to the model
you know I could I could get different
results so that's that's one possible
way the other thing is man it's just
hard to quantify this stuff uh it's hard
to quantify this stuff I think people
are very excited by new models when they
come out and then as time goes on they
they become very aware of the they
become very aware of the limitations so
that may be another effect but that's
that's all a very long- rended way of
saying for the most part with some
fairly narrow exceptions the models are
not changing I think there is a
psychological effect you just start
getting used to it the Baseline ra like
when people have first gotten Wi-Fi on
airplanes it's like amazing magic and
then now like I can't get this thing to
work this is such a piece of crap
exactly so it's easy to have the
conspiracy theory of they're making
Wi-Fi slower and slower this is probably
something I'll talk to Amanda much more
about but U another Reddit question uh
when will Claud stop trying to be my uh
panical grandmother imposing its moral
World viw on me as a paying customer and
also what does it that ology behind
making Claude overly apologetic so this
kind of reports about The Experience a
different angle on the frustration it
has to do with the character yeah so a
couple points on this first one is um
like things that people say on Reddit
and Twitter or X or whatever it is um
there's actually a huge distribution
shift between like the stuff that people
complain loudly about on social media
and what actually kind of like you know
statistically users care about and that
drives people to use the models like
people are frustrated with you know
things like you know the model not
writing out all the code or the model uh
you know just just not being as good at
code as it could be even though it's the
best model in the world on code um I I
think the majority of thing of things
are about that um uh but uh certainly a
a a kind of vocal minority are uh you
know kind kind of kind of rais these
concerns right are frustrated by the
model refusing things that it shouldn't
refuse or like apologizing too much or
just just having these kind of like
annoying verbal ticks um the second
caveat and I just want to say this like
super clearly because I think it's like
some people don't know it others like
kind of know it but forget it like it is
very difficult to control across the
board how the models behave you cannot
just reach in there and say oh I want
the model to like apologize less like
you can do that you can include trading
data that says like oh the models should
like apologize less but then in some
other situation they end up being like
super rude or like overconfident in a
way that's like misleading people so
they're they're all these tradeoffs um
uh for example another thing is if there
was a period during which models ours
and I think others as well were T
verbose right they would like repeat
themselves they would say too much um
you can cut down on the verbosity by
penalizing the models for for just
talking for too long what happens when
you do that if you do it in a crude way
is when the models are coding sometimes
they'll say of the code goes here right
because they've learned that that's a
way to economize and that they see it
and then and then so that leads the
model to be so-called lazy in coding
where they where they where they're just
like ah you can finish the rest of it
it's not it's not because we want to you
know save on compute or because you know
the models are lazy and you know during
winter break or any of the other kind of
conspiracy theories that have that have
that have come up it's actually it's
just very hard to control the behavior
of the model to steer the behavior of
the model in all circum ances at once
you can kind of there's this this whacka
aspect where you push on one thing and
like you know these these these you know
these other things start to move as well
that you may not even notice or measure
and so one of the reasons that I that I
care so much about uh you know kind of
grand alignment of these AI systems in
the future is actually these systems are
actually quite unpredictable they're
actually quite hard to steer and control
um and this version we're seeing today
of you make one thing better it makes
another thing worse uh I think that's
that's like a present day analog of
future control problems in AI systems
that we can start to study today right I
think I think that that that difficulty
in in steering the behavior and in
making sure that if we push an AI system
in One Direction it doesn't push it in
another Direction in some in some other
ways that we didn't want uh I think
that's that's kind of an that's kind of
an early sign of things to come and if
we can do a good job of solving this
problem right of like you ask the model
to like you know to like make and
distribute small pox and it says no but
it's willing to like help you in your
graduate level virology class like how
do we get both of those things at once
it's hard it's very easy to go to one
side or the other and it's a
multi-dimensional problem and so uh I
you know I think these questions of like
shaping the models personality I think
they're very hard I think we haven't
done perfectly on them I think we've
actually done the best of all the AI
companies but still so far from perfect
uh and I think if we can get this right
if we can control the the you know
control the false positives and false
negatives in this this very kind of
controlled present day environment will
be much better at doing it for the
future when our worry is you know will
the models be super autonomous will they
be able to you know make very dangerous
things will they be able to autonomously
you know build whole companies and are
those companies aligned so so I I I
think of this this present task as both
vacine but also good practice for the
future what's the current best way of
gathering sort of user feedback like uh
not anecdotal data but just large scale
data about pain points or the opposite
of pain points positive things so on is
it internal testing is it yeah A
specific group testing a testing what
what what works so so so typically um
we'll have internal model bashings where
all of anthropic anthropic is almost a
thousand people um you know people just
just try and break the model they try
and interact with it various ways um uh
we have a suite of evals uh for you know
oh is the model refusing in ways that
that it couldn't I think we even had a
certainly eval because you know our our
mod again at one point model had this
problem where like it had this annoying
tick where it would like respond to a
wide range of questions by saying
certainly I can help you with that
certainly I would be happy to do that
certainly this is correct um uh and so
we had a like certainly eval which is
like how how often does the model say
certainly uh uh but but look this is
just a whack-a-mole like like what if it
switches from certainly to definitely
like uh uh so you know every time we add
a new eval and we're always evaluating
for all the old things so we have
hundreds of these evaluations but we
find that there's no substitute for
human interacting with it and so it's
very much like the ordinary product
development process we have like
hundreds of people within anthropic bash
the model then we do uh you know then we
do external AB tests sometimes we'll run
tests with contractors we pay
contractors to interact with the model
um so you put all of these things
together and it's still not perfect you
still see behaviors that you don't quite
want to see right you know you see you
still see the model like refusing things
that it just doesn't make sense to
refuse um but I I I think trying to
trying to solve this challenge right
trying to stop the model from doing you
know genuinely bad things that you know
no one everyone agrees it shouldn't do
right you know everyone everyone you
know everyone agrees that you know the
model shouldn't talk about you know I I
don't know child abuse material right
like everyone agrees the model shouldn't
do that uh but but at the same time that
it doesn't refuse in these dumb and
stupid ways uh I think I think draw
drawing that line as finely as possible
approaching perfectly is still is still
a challenge and we're getting better at
it every day but there's there's a lot
to be solved and again I would point to
that as as an indicator of a challenge
ahead in terms of steering much more
powerful models do you think Claude 4.0
is ever coming out I don't want to
commit to any naming scheme because if I
say if I say here we're gonna have
Claude 4 next year and then and then you
know then we decide that like you know
we should start over because there's a
new type of mod like I I I I I I don't
want to I don't want to commit to it I
would expect in a normal course of
business that Claude four would come
after Claude 3.5 but but you know you
you you never know in this wacky field
right but the sort of this idea of
scaling is continuing scal scaling is
continuing there there will definitely
be more powerful models coming from us
in the models that exist today that is
that is certain or if there if there
aren't we've we've deeply failed as a
company okay can you explain the
responsible scaling policy and the AI
safety level standards ASL levels as
much as I'm excited about the benefits
of these models and you know we'll talk
about that if we talk about Machines of
Loving Grace um I'm I'm worried about
the risk and I continue to be worried
about the risks uh no one should think
that you know Machines of loveing Grace
was me me saying uh you know I'm no
longer worried about the risks of these
models I think they're two sides of the
same coin the the uh Power of the models
and their ability to solve all these
problems in you know biology
Neuroscience Economic Development
government governance and peace large
parts of the economy those those come
with risks as well right with great
power comes great responsibility right
that's the the two are the two are
paired uh things that are powerful can
do good things and they can do bad
things um I think of those risks as as
being in you know several different
different categories perhaps the two
biggest risks that I think about and
that's not to say that there aren't
risks today that are that are important
but when I think of the really the the
you know the things that would happen on
the grandest scale um one is what I call
catastrophic misuse these are misuse of
the models in domains like cyber bio
radiological nuclear right things that
could you know that could harm or even
kill thousands even millions of people
if they really really go wrong um like
these are the you know number one
priority to prevent and and here I would
just make a simple observation which is
that Mo the models you know if if I look
today at people who have done really bad
things in the world um uh I think
actually Humanity has been protected by
the fact that the overlap between really
smart well-educated people and people
who want to do really horrific things
has generally been small like you know
let's say let's say I'm someone who you
know uh you know I have a PhD in this
field I have a well-paying job um
there's so much to lose why do I want to
like you know even even assuming I'm
completely evil which which most people
are not um why why you know why would
such a person risk their risk their you
know risk their life RK risk their their
legacy their reputation to to do
something like you know truly truly evil
if we had a lot more people like that
the world would be a much more dangerous
place and so my my My worry is that by
being a a much more intelligent agent AI
could break that correlation and so I I
I I I do have serious worries about that
I believe we can prevent those worries
uh but you know I I think as a
Counterpoint to Machines of Loving Grace
I want to say that this is I there's
still serious risks and and the second
range of risks would be the autonomy
risks which is the idea that models
might on their own particularly as we
give them more agency than they've had
in the past uh particularly as we give
them supervision over wider tasks like
you know writing whole code bases or
someday even you know effectively
operating entire entire companies
they're on a long enough leash are they
are they doing what we really want them
to do it's very difficult to even
understand in detail what they're doing
let alone let alone control it and like
I said this these early signs that it's
it's hard to perfectly draw the boundary
between things the model should do and
things the model shouldn't do that that
you know if if you go to one side you
get things that are annoying and useless
and you go to the other side you get
other behaviors if you fix one thing it
creates other problems we're getting
better and better at solving this I
don't think this is an unsolvable
problem I think this is a you know this
is a science like like the safety of
airplanes or the safety of cars or the
safety of drugs I you know I I don't
think there's any big thing we're
missing I just think we need to get
better at controlling these models and
so these are these are the two risks I'm
worried about and our responsible
scaling plan which I'll recognize is a
very long-winded answer to your question
I love it I love it our responsible
scaling plan is designed to address
these two types of risks and so every
time we develop a new model we basically
test it for its ability to do both of
these bad things so if I were to back up
a little bit um I I think we have a I
think we have an interesting dilemma
with AI systems where they're not yet
powerful enough to present these
catastrophes I don't know that I don't
know they'll ever present prevent these
catastrophes it's possible they won't
but the the case for worry the case for
risk is strong enough that we should we
should act now and and they're they're
getting better very very fast right I
you know I testified in the Senate that
you know we might have serious bio risks
within two to three years that was about
a year ago things have preceded preceded
a pace uh uh so we have this thing where
it's like it's it's it's surprisingly
hard to to address these risks because
they're not here today they don't exist
they're like ghosts but they're coming
at us so fast because the models are
improving so fast so so how do you deal
with something that's not here today
doesn't exist but is is coming at us
very fast uh so the solution we came up
with for that in in collaboration with
uh you know people like uh the
organization meter and Paul Christiano
is okay what what what what you need for
that or you need tests to tell you when
the risk is getting close you need an
early warning system and and so every
time we have uh a new model we test it
for it capability to do these cbrn tasks
as well as testing it for you know how
capable it is of doing tasks
autonomously on its own and uh in the
latest version of our RSP which we
released in the last in the last month
or two uh the way we test autonomy risks
is the model the the AI model's ability
to do aspects of AI research itself uh
which when the model when the AI models
can do AI research they become kind of
truly truly autonomous on and that you
know that threshold is important for a
bunch of other ways and and so what do
we then do with these tasks the RSP
basically develops what we've called an
if then structure which is if the models
pass a certain capability then we impose
a certain set of Safety and Security
requirements on them so today's models
are what's called
asl2 models that were a asl1 is for
systems that manifestly don't pose any
risk of autonomy or misuse so for
example a chess plane bot deep blue
would be asl1 it's just manifestly the
case that you can't use deep blue for
anything other than chess it was just
designed for chess no one's going to use
it to like you know to conduct a
masterful Cyber attack or to you know
run wild and take over the world asl2 is
today's AI systems where we've measured
them and we think these systems are
simply not smart enough to uh to you
know autonomously self-replicate or
conduct a bunch of tasks uh and also not
smart enough to provide meaningful
information about cbrn risks and how to
build cbrn weapons above and beyond what
can be known from looking at Google uh
in fact sometimes they do provide
information but but not above and beyond
a search engine but not in a way that
can be stitched together um not not in a
way that kind of end to end is dangerous
enough so
asl3 is going to be the point at which
uh the models are helpful enough to
enhance the capabilities of non-state
actors right State actors can already do
a lot a lot of unfortunately to a high
level of proficiency a lot of these very
dangerous and destructive things the
difference is that non-state non-state
actors are not capable of it and so when
we get to asl3 we'll take special
security precautions designed to be be
sufficient to prevent theft of the model
by non-state actors and misuse of the
model as it's deployed uh will have to
have enhanced filters targeted at these
particular areas cyber bio nuclear cyber
bio nuclear and model autonomy Which is
less a misuse risk and more a risk of
the model doing bad things itself asl4
getting to the point where these models
could could enhance the capability of a
of a of a all knowledgeable State actor
Andor become the you know the main
source of such a risk like if you wanted
to engage in such a risk the main way
you would do it is through a model and
then I think asl4 on the autonomy side
it's it's some some some amount of
acceleration in AI research capabilities
with an with an AI model and then asl5
is where we would get to the models that
are you know that are that are kind of
that are kind of you know truly capable
that it could exceed Humanity in their
ability to do to do any of these tasks
and so the the the point of the if then
structure commitment is is basically to
say
look I don't know I've been I've been
working with these models for many years
and I've been worried about risk for
many years it's actually kind of
dangerous to cry wolf it's actually kind
of dangerous to say this you know this
this model is this model is risky and
you know people look at it and they say
this is manifestly not dangerous again
it's it's it's the the delicacy of the
risk isn't here to today but it's coming
at us fast how do you deal with that
it's it's really vexing to a risk
planner to deal with it and so this if
then structure basically says look we
don't want to antagonize a bunch of
people we don't want to harm our own you
know our our kind of own ability to have
a place in the conversation by imposing
these
these very honorous burdens on models
that are not dangerous today so the if
then the trigger commitment is basically
a way to deal with this says you claim
clamp down hard when you can show that
the model is dangerous and of course
what has to come with that is you know
enough of a buffer threshold that that
you know you can you can uh you know
you're you're you're you're not at high
risk of kind of missing the danger it's
not a perfect framework we've had to
change it every every uh you know we
came out with a new one just a few weeks
ago and probably probably going forward
we might release new ones multiple times
a year because it's it's hard to get
these policies right like technically
organizationally from a research
perspective but that is the proposal if
then commitments and triggers in order
to minimize burdens and false alarms now
but really react appropriately when the
dangers are here what do you think the
timeline for asl3 is where several of
the triggers are fired and what do you
think the timeline is for asl4 yeah so
that is hotly debated within the company
um uh we are working actively to prepare
asl3 uh security uh security measures as
well as ASL three deployment measures um
I'm not going to go into detail but
we've made we've made a lot of progress
on both and you know we're we're
prepared to be I think ready quite soon
uh I would I would not be surpris I
would not be surprised at all if we hit
ASL 3 uh next year there was some
concern that we we might even hit it uh
uh this year that's still that's still
possible that could still happen it's
like very hard to say but like I would
be very very surprised if it was like
2030 uh I think it's much sooner than
that so there's a protocols for
detecting it the if then and then
there's protocols for how to respond to
it yes how difficult is the second the
ladder yeah I think for asl3 it's
primarily about security um and and
about you know filters on the model
relating to a very narrow set of areas
when we deploy the model because at asl3
the model isn't autonomous yet um uh and
and so you don't have to worry about you
know kind of the model itself behaving
in a bad way even when it's deployed
internally so I think the asl3 measures
are are I won't say straightforward
they're they're they're they're rigorous
but they're easier to reason about I
think once we get to
asl4 um we start to have worries about
the models being smart enough that they
might sandbag tests they might not tell
the truth about tests um we had some
results came out about like sleeper
agents and there was a more recent paper
about you know can can the models uh uh
mislead attempts to you know s sandbag
their own abilities right show them you
know uh uh present themselves as being
less capable than they are and so I
think with asl4 there's going to be an
important component of using other
things than just interacting with the
models for example interpretability or
hidden chains of thought uh where you
have to look inside the model and verify
via some other mechanism that that is
not you know is not as easily corrupted
as what the model says
uh that that you know that that that the
model indeed has some property uh so
we're still working on asl4 one of the
properties of the RSP is that we we
don't specify asl4 until we've hit ASL 3
be and and I think that's proven to be a
wise decision because even with asl3 it
again it's hard to know this stuff in
detail and and it it we want to take as
much time as we can possibly take to get
these things right so for asl3 the bad
actor will be the humans humans yes and
so there it's a little bit more uh for
asl4 it's both I think it's both and so
deception and that's where mechanistic
interpretability comes into play and
hopefully the techniques used for that
are not made accessible to the model
yeah I mean of course you can hook up
the mechanistic contribut ability to the
model itself um but then You' then then
you then you've kind of lost it as a
reliable indicator of uh of uh of of of
the model State there are a bunch of
exotic ways you can think of that it
might also not be reliable like if the
you know model gets smart enough that it
can like you know jump computers and
like read the code where you're like
looking at its internal State we've
thought about some of those I think
they're exotic enough there are ways to
render them unlikely but yeah generally
you want to you want to preserve
mechanistic interpretability as a kind
of verification set or test set that's
separate from the training process of
the model see I think uh as these models
become better and better conversation
and become smarter social engineering
becomes a threat too cuz they oh yeah
that can start being very convincing to
the engineers inside companies oh yeah
yeah it's actually like you know we've
we've seen lots of examples of
demagoguery in our life from humans and
and you know there's a concern that
models could do that could do that as
well one of the ways that cloud has been
getting more and more powerful is it's
now able to do some agentic stuff um
computer use uh there's also an analysis
within the sandbox of claw. a itself but
let's talk about computer use that's
seems to me super exciting that you can
just give Claude a task and it uh takes
a bunch of actions figures it out and
has access to the your computer through
screenshots so can you explain how that
works uh and where that's headed yeah
it's actually relatively simple so
Claude has has had for a long time since
since Claude 3 back in March the ability
to analyze images and respond to them
with text the the only new thing we
added is those images can be screenshot
shots of a computer and in response we
train the model to give a location on
the screen where you can click Andor
buttons on the keyboard you can press in
order to take action and it turns out
that with actually not all that much
additional training the models can get
quite good at that task it's a good
example of generalization um you know
people sometimes say if you get to low
earth orbit you're like halfway to
anywhere right because of how much it
takes to escape the gravity well if you
have a strong pre-trained model I feel
like you're halfway to anywhere uh in
ter in terms of in terms of the
intelligence space uh uh uh and and and
so actually it didn't it didn't take all
that much to get to get Claude to do
this and you can just set that in a loop
give the model a screenshot tell it what
to click on give it the next screenshot
tell it what to click on and and that
turns into a full kind of almost almost
3D video interaction of the model and
it's able to do all of these tasks right
you know we we showed these demos where
it's able to like fill out spreadsheets
it's able to kind of like interact with
a website it's able to you know um you
know it's able to open all kinds of you
know programs different operating
systems Windows Linux Mac uh uh so uh
you know I think all of that is very
exciting I I will say while in theory
there's nothing you could do there that
you couldn't have done through just
giving the model the API to drive the
computer screen uh this really lowers
the barrier and you know there's there's
there's a lot of folks who who who
either you know kind of kind of ar ar
you know aren't in a position to to
interact with those apis or it takes
them a long time to do it's just the
screen is just a universal interface
that's a lot easier to interact with and
so I expect over time this is going to
lower a bunch of barriers now honestly
the current model has there's there it
leaves a lot still to be desired and we
were we were honest about that in the
blog right it makes mistakes it
misclicks and we we you know we were
careful to warn people hey this thing
isn't you can't just leave this thing to
you know run on your computer for
minutes and minutes um you got to give
this thing boundaries and guard rails
and I think that's one of the reasons we
released it first in an API form rather
than kind of you know this this kind of
just just hand it just hand it to the
consumer and give it control of their of
their of their of their computer um but
but you know I definitely feel that it's
important to get these capabilities out
there as models get more powerful we're
going to have to Grapple with you know
how do we use these capabilities safely
how do we prevent them from being abused
uh and and you know I think I think
releasing releasing the model while
while while the capabilities are are you
know are are still are still limited is
is is very helpful in terms of in terms
of doing that um you know I think since
it's been released a number of customers
I think uh repet was maybe was maybe one
of the the the most uh uh quickest
quickest quickest to quickest to deploy
things um have have you know have made
use of it in various ways people have
hooked up demos for you know Windows
desktops Macs
uh uh you know Linux Linux machines uh
so yeah it's been it's been it's been
very exciting I think as with as with
anything else you know it it it comes
with new exciting abilities and then
then then you know then then with those
new exciting abilities we have to think
about how to how to you know make the
model you know safe reliable do what
humans want them to do I mean it's the
same it's the same story for everything
right same thing it's that same tension
but but the possibility of use cases
here is just the the range is incredible
so uh how much to make it work really
well in the future how much do you have
to specially kind of uh go beyond what's
the pre-trained models doing do more
posttraining rhf or supervised
fine-tuning or synthetic data just for
the agent stff yeah I think speaking at
a high level It's Our intention to keep
investing a lot in you know making
making the model better uh like I think
I think uh you know we look at look at
some of the you know some of the
benchmarks where previous models were
like oh could do it 6% of the time and
now our model do at 14 or 22% of the
time and yeah we want to get up to you
know the human level reliability of 80
90% just like anywhere else right we're
on the same curve that we were on with
sbench where I think I would guess a
year from now the models can do this
very very reliably but you got to start
somewhere so you think it's possible to
get to the the human level 90% uh
basically doing the same thing you're
doing now or is it has to be special for
computer use I I mean uh depends what
you mean by by you know special and FAL
and
um but but you know I generally think
you know the same kinds of techniques
that we've been using to train the
current model I I expect that doubling
down in those techniques in the same way
that we have for code for code for
models in general for other k for you
know for image input um uh you know for
voice uh I expect those same techniques
will scale here as they have everywhere
else but this is giving sort of the
power of action to Claude And so you
could do a lot of really powerful things
but you could do a lot of damage also
yeah yeah no and we've been very aware
of that look my my view actually is
computer use isn't a fundamentally new
capability like the cbrn or autonomy
capabilities are um it's more like it
kind of opens the aperture for the model
to use and apply its existing abilities
uh and and so the way we think about it
going back to our RSP is nothing that
this model is
doing inherently increases
you know the risk from an RSP RSP
perspective but as the models get more
powerful having this capability may make
it scarier once it you know once it has
the cognitive capability to um you know
to do something at the asl3 and asl4
level this this you know this may be the
thing that kind of Unbound it from doing
so so going forward certainly this
modality of interaction is something we
have tested for and that we will
continue to test for in our going
forward um I think it's probably better
to have to learn and explore this
capability before the model is super uh
you know super capable yeah and there's
uh a lot of interesting attacks like
prompt injection because now you've
widened the aperture so you can prompt
inject through stuff on screen so if
this becomes more and more useful then
there's more and more benefit to inject
inject stuff into the model if it goes
to certain web page it could be harmless
stuff like advertisements or it could be
like harmful stuff right yeah I mean we
thought a lot about like spam capture
you know Mass C there's all you know
every every like if one secret I'll tell
you if you've invented a new technology
not necessarily the biggest misuse but
but the the first misuse you'll see
scams just Petty scams like you just
just just it's it's like it's like a
thing as old people scamming each other
it's it's this it's this thing as old as
time um and and and it's just every time
you got to deal with it it's almost like
silly to say but it's it's true sort of
and spam in general is a thing as it
gets more and more intelligent it's uh
there a lot of like like I said like
there are a lot of petty criminals in
the world and and and you know it's like
every new technology is like a new way
for petty petty criminals to do
something you know something stupid and
malicious um is there any ideas about
sandboxing it like how difficult is the
sandboxing task yeah we sandbox during
training so for example during training
we didn't expose the model to the
internet um I think that's probably a
bad idea during training because uh you
know the model can be changing its
policy it can be changing what it's
doing and it's having an effect in the
real world um uh you know in in terms of
actually deploying the model right it
kind of depends on the application like
you know sometimes you want the model to
do something in the real world but of
course you can always put guard you can
always put guard rails on the outside
right you can say okay well you know
this model is not going to move data
from my you know this model is not going
to move any files from my computer to or
my web server to anywhere else now when
you talk about sandboxing again when we
get to asl4 none of these precautions
are going to make sense there right
where when you when you talk about asl4
you're then the model is being kind of
you know there's a a theoretical worry
the model could be smart enough to break
it to to kind of break out of any box
and so there we need to think about
mechanistic interpretability about you
know if we're if we're going to have a
Sandbox it would need to be a
mathematically provable sound but you
know that's that's a whole different
world than what we're dealing with with
the models
today yeah the science of building a box
from which asl4 AI system cannot Escape
I I think it's probably not the right
approach I think the right approach
instead of having something you know
unaligned that that like you're trying
to prevent it from escaping I think it's
it's better to just design the model the
right way or have a loop where you you
know you look inside you look inside the
model and you're able to verify property
and that gives you an opportunity to
like iterate and actually get it right
um I think I think containing uh
containing bad models is is is much
worse solution than having good models
let me ask about regulation what's the
role of regulation in keeping AI safe so
for example he described California AI
regulation Bill SB 1047 that was
ultimately vetoed by the governor what
are the pros and cons of this bill
General yes we ended up making some
suggestions to the bill and then some of
those were opted and you know we felt I
think I think quite positively uh uh
quite positively about about the bill uh
by by the end of that um it did still
have some downsides um uh and you know
of course of course it got vetoed um I
think at a high level I think some of
the key ideas behind the bill um are you
know I would say similar to ideas behind
our rsps and I think it's very important
that some jurisdiction whether it's
California or the federal government
Andor other countries and other states
passes some regulation like this and I
can talk through why I think that's so
important so I feel good about our RSP
it's not perfect it needs to be iterated
on a lot but it's been a good forcing
function for getting the company to take
these risks seriously to put them into
product planning to really make them a
central part of work at anthropic and to
make sure that all the thousand people
and it's almost a thousand people now at
anthropic understand that this is one of
the highest priorities of the company if
not the highest priority uh
but one there are some there are still
some companies that don't have RSP like
mechanisms like open aai Google uh did
adopt these mechanisms a couple months
after uh after anthropic did uh but
there are there are other companies out
there that don't have these mechanisms
at all uh and so if some companies adopt
these mechanisms and others don't uh
it's really going to create a situation
where you know some of these dangers
have the property that it doesn't matter
if three out of five of the companies
are being safe if the other two are are
being are being unsafe it creates this
negative externality and and I think the
lack of uniformity is not fair to those
of us who have put a lot of effort into
being very thoughtful about these
procedures the second thing is I don't
think you can trust these companies to
adhere to these voluntary plans in their
own right I like to think that anthropic
will we do everything we can that we
will our our our our RSP is checked by
our long-term benefit trust uh so you
know we do everything we can to to to
adhere to our own RSP um but you know
you hear lots of things about various
companies saying oh they said they would
do they said they would give this much
compute and they didn't they said they
would do this thing and they didn't um
you know I don't I don't think it makes
sense to you know to to to you know
litigate particular things that
companies have done but I I think this
this broad principle that like if
there's nothing watching over them
there's nothing watching over us as an
industry there's no guarantee that we'll
do the right thing and the stakes are
very high uh and so I think it's I think
it's important to have a uniform
standard that that that that that
everyone follows and to make sure that
simply that the industry does what a
majority of the industry has already
said is important and has already said
that they definitely will do right some
people uh you know I think there's there
a class of people who are against
regulation on principle I understand
where that comes from if you go to
Europe and you know you see something
like gdpr you see some of the other
stuff that that that that that they've
done you know some of it's good but but
some of it is really unnecessarily
burdensome and I think it's fair to say
really has slowed really has slowed
Innovation and so I understand where
people are coming from on priors I
understand why people come from start
from that start from that position uh
but but again I think AI is different if
we go to the very serious risks of
autonomy and misuse that that that I
talked about you know just a just a few
minutes ago I think that those are
unusual and they weren't an unusually
strong response uh and so I I think it's
very important again um we need
something that everyone can get behind
uh you know I think one of the issues
with
s1047 uh especially the original version
of it was it it had a bunch of the
structure of rsps but it also had a
bunch of stuff that was either clunky or
that that that just would have created a
bunch of burdens a bunch of Hassle and
might even have missed the Target in
terms of addressing the risks um you
don't really hear about it on Twitter
you just hear about kind of you know
people are people are cheering for any
regulation and then the folks who are
against make up these often quite
intellectually dishonest arguments about
how you know it you know it'll make us
move away from California bill bill
doesn't apply if you're headquartered in
California bill only applies if you do
business in California um or that it
would damage the open source ecosystem
or that it would you know it would cause
cause all of these things I I think
those were mostly nonsense but there are
better arguments against regulation
there's one guy uh Dean ball who's
really you know I think a very scholarly
scholarly IST who who looks at what
happens when a regulation is put in
place and ways that they can kind of get
a life of their own or how they can be
poorly designed and so our interest has
always been we do think there should be
regulation in this space but we want to
be an actor who makes sure that that
that that regulation is something that's
surgical that's targeted at the serious
risks and is something people can
actually comply with because something I
think The Advocates of Regulation don't
understand as well as they could is if
we get something in place that is um
that's poorly targeted that wastes a
bunch of people's time what's going to
happen is people are going to say see
these safety risks there you know this
is this is nonsense I just you know I
just had to hire 10 lawyers to to you
know to fill out all these forms I had
to run all these tests for something
that was clearly not dangerous and after
6 months of that there will be there
will be a ground sweep well and we'll
we'll we'll we'll end up with a durable
consensus against regulation and so the
I I think the the worst enemy of those
who want real accountability is badly
designed regulation um we we need to
actually get it right uh and and this is
if there's one thing I could say to The
Advocates it it would be that I want
them to understand this Dynamic better
and we need to be really careful and we
need to talk to people who actually have
who actually have experience seeing how
regulations play out in practice and and
the people who have seen that understand
to be very careful if this was some
lesser issue I might be against
regulation at all but what what I want
the opponents to understand is is that
the underlying issues are actually
serious they're they're not they're not
something that I or the other companies
are just making up because of regulatory
capture they're not sci-fi fantasies
they're not they're not any of these
things um you know every every time we
have new model every few months we
measure the behavior of these models and
they're getting better and better at
these concerning tasks just as they are
getting better and better at um you know
good valuable economically useful tasks
and so I I I I would just love it if
some of the former you know I think
sb147 was very polarizing I would love
it if some of the most reasonable
opponents and some of the most
reasonable um uh proponents uh would sit
down together and you know I think I
think that you know the different the
different AI companies um you know
anthropic was the the only AI company
that you know felt positively in a very
detailed way I think Elon tweeted uh
tweeted briefly something positive but
you know some of the some of the big
ones like Google open AI meta Microsoft
were were pretty St stly against so I
would really like is if if you know some
of the key stakeholders some of the you
know thoughtful proponents and and some
of the most thoughtful opponents would
sit down and say how do we solve this
problem in in a way that the proponents
feel brings a real reduction in risk and
that the opponents feel that it is not
it is not hampering the the industry or
hampering Innovation any more necessary
than it than than than it needs to and
and I think for for whatever reason that
things got too polarized and those two
groups
didn't get to sit down in the way that
they should uh and and I feel I feel
urgency I really think we need to do
something in
2025 uh uh you know if we get to the end
of 2025 and we've still done nothing
about this then I'm going to be worried
I'm not I'm not worried yet because
again the risks aren't here yet but but
I I I think time is running short yeah
and come up with something surgical like
you said yeah yeah yeah exactly and and
we need to get we need to get away from
this this this intense
pro- safety versus intense
anti-regulatory rhetoric right it's
turned into these these flame Wars on
Twitter and nothing Good's going to come
with that so there's a lot of curiosity
about the different players in the game
one of the uh ogs is open AI you have
had several years of experience at open
AI what's your story and history there
yeah so I was at open AI for uh for
roughly five years uh for the last I
think it was a couple years you know I I
I I I I was uh vice president of
research there um probably myself and
Ilia suger were the ones who you know
really kind of set the set the research
Direction around 2016 or 2017 I first
started to really believe in or at least
confirm my belief in the scaling
hypothesis when when Ilia famously said
to me the thing you need to understand
about these models is they just want to
learn the models just want to learn um
and and and and again sometimes there
are these One S there these one
sentences these Zen cones that you hear
them and you're like ah that that
explains everything that explains like a
thousand things that I've seen and then
and then I I you know ever after I had
this visualization in my head of like
you optimize the models in the right way
you point the models in the right way
they just want to learn they just want
to solve the problem regardless of what
the problem is so get out of their way
basically get out of their way yeah
don't impose your own ideas about how
they should learn and you know this was
the same thing as Rich Sutton put out in
the bitter lesson or G put out in the
scaling hypothesis you know I think
generally the dynamic was you know I got
I got this kind of inspiration from uh
from from from Ilan from others folks
like Alec Radford who did the the
original uh uh
gpt1 uh and then uh ran really hard with
it me me and my collaborators on gpt2
gpt3 RL from Human feedback which was an
attempt to kind of deal with the early
safety and durability things like debate
and amplification heavy on
interpretability so again the
combination of safety plus scaling
probably 2018 2019 2020 those those were
those were kind of the years when myself
and my collaborators probably um you
know mo mo many many of whom became
co-founders of anthropic kind of really
had had had a vision and like and like
drove the direction why'd you leave why'
you decid to leave yeah so look I'm
gonna put things this way and I you know
I think it I think it ties to the to to
the race to the top right which is you
know in my time at open AI what I come
come to see as I'd come to appreciate
the scaling hypothesis and as I'd come
to appreciate kind of the importance of
safety along with the scaling hypothesis
the first one I think you know open AI
was was getting was getting on board
with um the second one in a way had
always been part of of open ai's
messaging um but uh you know over over
many years of of the time the time that
I spent there I think I had a particular
vision of how these how we should handle
these things how we should be brought
out in the world the kind of principles
that the organization should have and
look I mean there were like many many
discussions about like you know should
the or do should the company do this
should the company do that like there's
a bunch of misinformation out there
people say like we left because we
didn't like the deal with Microsoft
false although you know there was like a
lot of discussion a lot of questions
about exactly how we do the deal with
Microsoft um we left because we didn't
like commercialization that's not true
we built gbd3 which was the model that
was commercialized I was involved in
commercialization it's it's more again
about how do you do it like Civilization
is going down this path to very powerful
AI what's the way to do it that is
cautious
straightforward honest um that build
trust in the organization and in
individuals how do we get from here to
there and how do we have a real vision
for how to get it right how can safety
not just be something we say because it
helps with recruiting um and you know I
think I think at the end of the day um
if you have a vision for that forget
about anyone else's Vision I don't want
to talk about anyone else's Vision if
you have a vision for how to do it you
should go off and you should do that
Vision it is incredibly unproductive to
try and argue with someone else's Vision
you might think they're not doing it the
right way you might think they're
they're they're dishonest who knows
maybe you're right maybe you're not um
uh but uh what what you should do is you
should take some people you trust and
you should go off together and you
should make your vision happen and if
your vision is compelling if you can
make it appeal to people some you know
some combination of ethically you know
in the market uh you know if if you can
if you can make a company that's a place
people want to join uh that you know
engages in practices that people think
are are reasonable while managing to
maintain its position in the ecosystem
at the same time if you do that people
will copy it um and the fact that you
were doing it especially the fact that
you're doing it better than they are um
causes them to change their behavior in
a much more compelling way than if
they're your boss and you're arguing
with them I just I don't know how to be
any more specific about it than that but
I think it's generally very unproductive
to try and get someone else's Vision to
look like your vision um it's much more
productive to go off and do a clean
experiment and say this is our vision
this is how this is this is how we're
going to do things your choice is you
can you can ignore us you can reject
what we're doing or you can you can
start to become more like us and
imitation is the sincerest form of
flattery um and you know that that that
plays out in the behavior of customers
that PS out in the behavior of the
public that plays out in the behavior of
where people choose to work uh and again
again at the end it's it's not about one
company winning or another company
winning if if we or another company are
engaging in some practice that you know
people people find genuinely appealing
and I want it to be in substance not
just not just in appearance um and you
know I think I think researchers are
sophisticated and they look at substance
uh and then other companies start
copying that practice and they win
because they copied that practice that's
great that's success that's like the
race to the top it doesn't matter who
wins in the end as long as everyone is
copying everyone else's good practices
right one way I think of it is like the
thing we're all afraid of is a race the
bottom right in the race to the bottom
doesn't matter who wins because we all
lose right like you know in the most
extreme world we we make this autonomous
AI that you know the robots enslave us
or whatever right I mean that's half
joking but you know that that is the
most extreme uh uh thing thing that
could happen then then it doesn't matter
which company was ahead um if instead
you create a race to the top where
people are competing to engage in good
in good practices uh then you know at at
the end of the day you know it doesn't
matter who who ends up who ends up
winning doesn't even matter who who
started the race to the top the point
isn't to be virtuous the point is to get
the system into a better equilibrium
than it was before and and individual
companies can play some role in doing
this individual companies can can you
know can help to start it can help to
accelerate it and frankly I think
individuals at other companies have have
done this as well right the individuals
that when we put out an RSP react by
pushing harder to to to get something
similar done get something similar done
at at at other companies sometimes other
companies do something that's like we're
like oh it's a good practice we think we
think that's good we should adopt it too
the only difference is you know I think
I think we are um we try to be more
forward leaning we try and adopt more of
these practices first and adopt them
more quickly when others when others
invent them but I think this Dynamic is
what we should be pointing at and that I
think I think it abstracts away the
question of you know which company's
winning who trusts who I I think all
these all these questions of drama are
are profoundly uninteresting and and the
the thing that matters is the ecosystem
that we all operate in and how to make
that ecosystem better because that
constrains all the players and so
anthropic is this kind of clean
experiment built on a foundation of like
what concretely AI safety should look
like we look I'm sure we've made plenty
of mistakes along the way the perfect
organization doesn't exist it has to
deal with the the imperfection of a
thousand employees it has to deal deal
with the imperfection of our leaders
including me it has to deal with the
imperfection of the people we've put
we've put to you know to oversee the
imperfection of the of the leaders like
the like the board and the long-term
benefit trust it's it's all it's all a
set of imperfect people trying to aim
imperfectly at some ideal that will
never perfectly be achieved um that's
what you sign up for that's what it will
always be but uh uh imperfect doesn't
mean you just give up there's better and
there's worse and hopefully hopefully we
can begin to build we can do well enough
that we can begin to build some
practices that the whole industry
engages in and then you know my guess is
that M multiple of these companies will
be successful anthropic will be
successful these other companies like
ones I've been at the past will also be
successful and some will be more
successful than others that's less
important than again that we we align
the incentives of the industry and that
happens partly through the race to the
top partly through things like RSP
partly through again selected surgical
regulation you said Talent density beats
Talent
Mass so can you explain that can you
expand on it can you just talk about
what it takes to build a great team of
AI researchers and Engineers this is one
of these statements that's like more
true every every every month every month
I see this statement as more true than I
did the month before so if I were to do
a thought experiment let's say you have
a team of 100 people that are super
smart motivated and aligned with the
mission and that's your company or you
can have a team of a thousand people
where 200 people are super smart super
aligned with the mission and then uh
like and then like 800 people are let's
just say you pick 800 like random random
big Tech employees which would you
rather have right the talent mass is
greater in in the group of uh in the
group of a thousand people right you
have you have even even a larger number
of incredibly talented incredibly
aligned incredibly smart people um uh
but but the the issue is just that
if every time someone super talented
looks around they see someone else super
talented and super dedicated that sets
the tone for everything right that sets
the tone for everyone is super inspired
to work at the same place everyone
trusts everyone else if you have a
thousand or 10,000 uh people and and
things have really regressed right you
are not able to do selection and you're
choosing random people what happens is
then you need to put a lot of proc CES
and a lot of guard rails in place um
just because people don't fully trust
each other you have to adjudicate
political battles like there are so many
things that slow down the org's ability
to operate and so we're nearly a
thousand people and you know we've we've
we've tried to make it so that as large
a fraction of those thousand people as
possible are like super talented super
skilled it's one of the reasons we've
we've slowed down hiring a lot in the
last few months We Grew From 300 to 800
I believe I think in the first seven
eight months of the year and now we've
slowed down we're at like you know last
three months we went from 800 to 900 950
something like that don't quote me on
the exact numbers but I think there's an
inflection point around a thousand and
we want to be much more careful how how
we how we grow uh early on and and now
as well you know we've hired a lot of
physicists um you know theoretical
physicists can learn things really fast
um uh even even more recently as we've
continued to hire that you know we've
really had a high bar for on both the
research side and the software
engineering side have hired a lot of
senior people including folks who used
to be at other at other companies in
this space and we we've just continued
to be very selective it's very easy to
go go from 100 to a th000 a th000 to
10,000 without paying attention to
making sure everyone has a unified
purpose it's so powerful if your company
consists of a lot of different feif that
all want to do their own thing they're
all optimizing for their own thing um uh
it's very hard to get anything done but
if everyone sees the broader purpose of
the company if there's trust and there's
dedication to doing the right thing that
is a superpower that in itself I think
can overcome almost every other
disadvantage and you know it's to Steve
Jobs a players a players want to look
around and see other a players is
another way of of saying I don't know
what that is about human nature but it
is demotivating to see people who are
not obsessively driving towards a
singular Mission and it is on the flip
side of that super motivating to see
that it's interesting uh what's it take
to be a great AI researcher or engineer
from everything you've seen from working
with so many amazing people yeah um I
think the number one quality especially
on the research side but really both is
open-mindedness sounds easy to be
open-minded right you're just like oh
I'm open to anything um but you know if
I if I think about my own early history
in the scaling hypothesis um I was
seeing the same data others were seeing
I don't think I was like a better
programmer or better at coming up with
research ideas than any of the hundreds
of people that I worked with um in some
ways in some ways I was worse um uh you
know like i' I've never like you know
precise programming of like you know
finding the bug writing the GPU kernels
like I could point you to a 100 people
here who are better who are better at
that than I am um but but the the thing
that that that I think I did have that
was different was that I was just
willing to look at something with new
eyes right people said oh you know we
don't have the right algorithms yet we
haven't come up with the right the right
way to do things and I was just like uh
I don't know like you know this neural
net has like 30 billion 30 million
parameters like what if we gave it 50
million instead like let's plot some
graphs like that that basic scientific
mindset of like oh man like I I I just I
just like I you know I see some variable
that I could change like what happens
when it changes like let's let's try
these different things and like create a
graph for even this this was like the
simplest thing in the world right change
the number of you know this wasn't like
PhD level experimental design this was
like this was like simple and stupid
like anyone could have done this if you
if you just told them that that that it
was important it's also not hard to
understand you didn't need to be
brilliant to come up with this um but
you put the two things together and you
know some tiny number of people some
singled digigit number of people have
have driven forward the whole field by
realizing this uh and and it's you know
it's often like that if you look back at
the Discover you know the discoveries in
in in history they're they're often like
that and so this this open-mindedness
and this willingness to see with new
eyes that often comes from being newer
to the field often experience is a
disadvantage for this that is the most
important thing it's very hard to look
for and test for but I think I think
it's the most important thing because
when you when you find something some
really new way of thinking thinking
about things when you have the
initiative to do that it's absolutely
transformative and also be able to do
kind of Rapid experimentation and in the
face of that be open-minded and curious
and looking at the data from just these
fresh eyes and see what is that actually
saying that applies in uh mechanistic
interpretability it's another example of
this like some of the early work in
mechanistic interpretability so simple
it's it's just no one thought to care
about this question before you said what
it takes to be a great AI researcher can
we rewind the clock back what what
advice would you give to people
interested in AI they're young looking
forward how can I make an impact on the
world I think my number one piece of
advice is to just start playing with the
models um this was actually I I I worry
a little this seems like obvious advice
now I think three years ago it wasn't
obvious and people started by oh let me
read the latest reinforcement learning
paper let me you know let me let me kind
of um no I mean that was really the that
was really the the and I mean you should
do that as well but uh now you know with
wider availability of models and apis
people are doing this more but I think I
think just experiential knowledge um
these models are new artifacts that no
one really understands um and so getting
experience playing with them I would
also say again in line with the like do
something new think in some new
Direction like there are all these
things that haven't been explored like
for example mechanistic interpretability
is still very new it's probably better
to work on that than it is to work on
new model architectures because it's you
know it's more popular than it was
before there are probably like a hundred
people working on it but there aren't
like 10,000 people working on it and
it's it's just this just this this
fertile area for study like like you
know it's there's there's so much like
low hangen fruit you can just walk by
and you know you can just walk by and
you can pick things um and and the the
the only reason for whatever reason
people aren't people aren't interested
in it enough I think there are some
things around long long Horizon learning
and long Horizon tasks where there's a
lot to be done I think evaluations are
still we're still very early in our
ability to study evaluations
particularly for dynamic systems acting
in the world I think there's some stuff
around
multi-agent um skate where the puck is
going is my is my advice and you don't
have to be brilliant to think of it like
all the things that are going to be
exciting in 5 years like in in people
even mention them as like you know
conventional wisdom but like it's it's
just somehow there's this barrier that
people don't people don't double down as
much as they could or they're afraid to
do something that's not the popular
thing I don't know why it happens but
like getting over that barrier is the
that's the my number one piece of advice
let's talk if we could a bit about
posttraining yeah so it uh seems that
the modern posttraining
recipe has uh a little bit of everything
so supervised fine tuning
rhf uh the the the Constitutional AI
with RL a if best acronym it's again
that naming
thing uh and then synthetic data seems
like a lot of synthetic data or at least
trying to figure out ways to have high
quality synthetic data so what's the uh
if this is a secret sauce that makes
anthropic claw so uh incredible what how
how much of the magic is in the
pre-training how much much of is in the
post training yeah um I mean so first of
all we're not perfectly able to measure
that ourselves um uh you know when you
see some some great character ability
sometimes it's hard to tell whether it
came from pre-training or post-training
uh we developed ways to try and
distinguish between those two but
they're not perfect you know the second
thing I would say is you know it's when
there is an advantage and I think we've
been pretty good at in general in
general at RL Perhaps Perhaps the best
although although I don't know because I
don't see what goes on inside other
companies uh
usually it isn't oh my God we have this
secret magic method that others don't
have right usually it's like well you
know we got better at the infrastructure
so we could run it for longer or you
know we were able to get higher quality
data or we were able to filter our data
better or we able to you know combine
these methods and practice it's it's
usually some boring matter of matter of
kind of uh practice and tradecraft um so
you know when I think about how to do
something special in terms of how we
train these models both pre-training but
even more so posttraining um you know I
I I really think of it a little more
again as like designing airplanes or
cars like you know it's not just like oh
man I have the BL blueprint like maybe
that makes you make the next airplane
but like there's some there's some
cultural tradecraft of how we think
about the design process that I think is
more important than than you know than
than any particular Gizmo were able to
invent okay well about let me ask you
about specific techniques so first on
rhf what do you think think just zooming
out intuition almost philosophy why do
you think rhf works so well if I go back
to like the scaling hypothesis one of
the ways to skate the scaling hypothesis
is if you train for x and you throw
enough compute at it um then you get X
and and so rlf is good at doing what
humans want the model to do or at least
um to State it more precisely doing what
humans who look at the model for a brief
period of time and consider different
possible responses what prefer as the
response uh which is not perfect from
both a safety and capabilities
perspective in that humans are are often
not able to perfectly identify what the
model wants and what humans want in the
moment may not be what they want in the
long term so there's there's a lot of
subtlety there but the models are good
at uh you know producing what the humans
in some shallow sense want uh and it
actually turns out that you don't even
have to throw that much compute at it
because of another thing which is this
this thing about a strong pre-trained
model being halfway to anywhere uh uh uh
so once you have the pre-trained model
you have all the representations you
need to to get the model uh to get the
model where you where you want it to go
so do you think
rhf makes the model smarter or just
appears smarter to the humans I don't
think it makes the model smarter I don't
think it just makes the model appear
smarter it's like
rhf like Bridges the gap between the
human and the model right I could have
something really smart that like can't
communicate at all right we all know
people like this um people who are
really smart but that you know can't
understand what they're saying um uh so
I think I think rhf just bridges that
Gap um I I think it's not it's not the
only kind of RL we do it's not the only
kind of RL that will happen in the
future I think RL has the potential to
make models smarter to make them reason
better to make them operate better to
make them develop new skills even and
perhaps that could be done you know even
in some cases with human feedback but
the kind of rhf we we do today mostly
doesn't do that yet although we're very
quickly starting to be able to but it it
appears to sort of increase if you look
at the metric of helpfulness it
increases that it also increases what
was this this word in Leopold's essay un
hobbling where basically the models are
hobbled and then you do various
trainings to them to un hobble them so I
I know I like that word because it's
like a rare word but so so I think rhf
un hobbles the models in some ways
um and then there are other ways where M
hasn't yet been un hobbled and and you
know needs to needs to un hobble if you
can say in terms of cost is pre-training
the most expensive thing or is
post-training creep up to that at the
present moment it is still the case that
uh pre-training is the majority of the
cost I don't know what to expect in the
future but I could certainly anticipate
a future where post-training is the
majority of the cost in that future you
anticipate would it be the humans or the
AI That's the costly thing for the Post
training I I I I I don't think you can
scale up humans enough to get high
quality any any kind of method that
relies on humans and uses a large amount
of compute it's going to have to rely on
some scaled supervision method like uh
uh like um it you know debate or
iterated amplification or something like
that so on
that super interesting um set of ideas
around constitutional AI can describe
what it is as first detailed in December
2022 paper and uh and be on that what is
it yes so this was from two years ago
the basic idea is so we describe what
rhf is you have uh you have a model and
uh it you know spits out two you know
like you just sample from it twice it
spits out two possible responses and
you're like human which response you
like better or another variant of it is
rate this response on a scale of 1 to
seven so that's hard because you need to
scale up human interaction and uh it's
very implicit right I don't have a sense
of what I what I want the model to do I
just have a sense of like what this
average of a thousand humans wants the
model to do so two ideas one is could
the AI system itself decide which uh
which response is better right could you
show the AI system these two responses
and and ask which which which response
is better and then second well what
Criterion should the AI use and so then
there's this idea because you have a
single document a constitution if you
will that says these are the principles
the model should be using to to respond
and the AI system reads those um it
reads those principles as well as
reading the environment and the response
and it says well how good did the AI
model do um it's basically a form of
self-play you you're kind of training
the model against itself and so the AI
gives the response and then you feed
that back into What's called the
preference model which in turn feeds the
model to make it better um so you have
this triangle of like the AI the
preference model and the Improvement of
the AI itself and we should say that in
the Constitution the set of principles
are like human interpretable they're
like yeah yeah it's something both the
human and the AI system can read so it
has this nice this nice kind of
translatability or symmetry um you know
in in practice we both use a model
Constitution and we use rhf and we use
some of these other methods so it's it's
turned into one tool in a in a toolkit
that both reduces the need for rhf and
increases the value we get from um from
from using each data point of R lhf um
it also interacts in interesting ways
with kind of future reasoning type RL
methods so um it's it's one tool in the
toolkit but but I I think it is a very
important tool well it's a compelling
one to us humans you know thinking about
the founding fathers and the founding of
the United
States the natural question is who and
how do you think it gets to define the
constitution the the set of principles
in the Constitution yeah so I'll give
like a practical um answer and a more
abstract answer I think the Practical
answer is like look in practice models
get used by all kinds of different like
customers right and and so uh you can
have this idea where you know the model
can can have specialized rules or
principles you know we fine-tune
versions of models implicitly we've
talked about doing it explicitly having
having special principles that people
can can build into the models um uh so
from a practical perspective the answer
can be very different from different
people uh you know customers service
agent uh you know behaves very
differently from a lawyer and obeys
different principles um but I think at
the base of it there are specific
principles that the models uh you know
have to obey I think a lot of them are
things that people would agree with
everyone agrees that you know we don't
you know we don't want models to present
these cbrn risks um I think we can go a
little further and agree with some basic
principles of democracy and the rule of
law beyond that it gets you know very
uncertain and and there our goal is
generally for the models to be more
neutral to not espouse a particular
point of view and you know more just be
kind of like wise uh agents or advisers
that will help you think things through
and will you know present present
possible considerations but you know
don't express you know stronger specific
opinions open AI released a model spec
where it kind of clearly concretely
defines some of the goals of the model
and specific examples like AB how the
model should behave do you find that
interesting by the way I should mention
the I believe the brilliant John
Schulman was a part of that he's now an
anthropic uh do you think this is a
useful Direction might anthropic release
a model spec as well yeah so I think
that's a pretty useful direction again
it has a lot in common with uh
constitutional AI so again another
example of like a race to the top right
we have something that's like we think
you know a better and more responsible
way of doing things um it's also a
competitive advantage um then uh others
kind of you know discover that it has
advantages and then start to do that
thing uh we then no longer have the
competitive Advantage but it's good from
the perspective that now everyone has
adopted a positive practice that others
were not adopting and so our response to
that as well looks like we need a new
competitive advantage in order to keep
driving this race upwards um so that's
that's how I generally feel about that I
also think every implementation of these
things is different so you know there
were some things in the model spec that
were not in constitutional Ai and so you
know we you know we can always we can
always adopt those things or you know at
least learn from them um so again I
think this is an example of like the
positive Dynamic that uh that that that
I that that I think we should all want
the field to have let's talk about the
incredible ESS Machines of love and
grace I recommend everybody read it it's
a long one it is rather long yeah it's
really refreshing to read concrete ideas
about what a positive future looks like
and you took sort of a bold stance
because like it's very possible you
might be wrong on the dates or specific
applications yeah I'm fully expecting to
you know to definitely be wrong about
all the details I might be be just
spectacularly wrong about the whole
thing and people will you know will
laugh at me for years um uh that's
that's how that's that's just how the
future works so you provided a bunch of
concrete positive impacts of AI and how
you know exactly a super intelligent AI
might accelerate the rate of
breakthroughs in for example biology and
chemistry that would then lead to things
like we cure most cancers prevent all
infectious disease double the human
lifespan and so on so let's talk about
this essay first can you give a high
level vision of this essay and um what
key takeaways that people should have
yeah I have spent a lot of time and
anthropic has spent a lot of effort on
like you know how do we address the
risks of AI right how do we think about
those risks like we're trying to do a
race to the top you know that requires
us to build all these capabilities and
the abilities are cool but you know you
know we're we're we're like a big part
of what we're trying to do is like is
like address the risks and the
justification for that is like well you
know all these positive things you know
the the market is this very healthy
organism right it's going to produce all
the positive things the risks I don't
know we might mitigate them we might not
and so we can have more impact by trying
to mitigate the risks but I noticed that
one flaw in that way of thinking and
it's if not a change in how seriously I
take the risks it's it's maybe a change
in how I talk about them um is that you
know no matter how kind of logical or
rational that line of reasoning that I
just gave might be um if if you kind of
only talk about risks your brain only
thinks about risks and and so I think
it's actually very important to
understand what if things do go well and
the whole reason we're trying to prevent
these risks is not because we're afraid
of Technology not because we want to
slow it down it's it's it's
because if we can get to the other side
of these risks right if we can run the
gauntlet successfully um to you know to
to put it in Stark terms then then on
the other side of the gauntlet are all
these great things and these things are
worth fighting for and these things can
really inspire people and I think I
imagine because look you have all these
investors all these VCS all these AI
companies talking about all the positive
benefits of AI but as you point out it's
it's it's weird there's actually a dir
of really getting specific about it
there's a lot of like random people on
Twitter like posting these kind of like
gleaming cities and this this just kind
of like Vibe of like grind accelerate
harder like kick out the D you know it's
it's just this very this very like
aggressive ideological but then you're
like well what are you what what what
what what are you actually excited about
um and so and so I figured that you know
I think it would be interesting and
valuable for someone who's actually
coming from the risk side to to try and
and to try and really make a try at at
explaining explaining explaining what
the benefits are um both because I think
it's something we can all get behind and
I want people to understand I want them
to really understand that this isn't
this isn't doomers versus
accelerationists um this this
is that if you have a true understanding
of of where things are going with with
AI and maybe that's the more important
axis AI is moving fast versus AI is not
moving fast then you really appreciate
the benefits and you you you you really
you want Humanity our civilization to
seize those benefits but you also get
very serious about anything that could
derail them so I think the starting
point is to talk about what this
powerful AI which is the term you like
to use uh most of the world uses AGI but
you don't like the term because it's uh
basically has too much baggage has
become meaningless it's like we're stuck
with the terms like maybe we're stuck
with the terms and my efforts to change
them are futile it's ADM I'll tell you
what else I don't this is like a
pointless semantic point but I I I I
keep talking about it public so I'm just
I'm just going to do it once more um uh
I I think it's it's a little like like
let's say it was like 1995 and Mor's law
is making the computers faster and like
for some reason there there there there
had been this like verbal tick that like
everyone was like well someday we're
going to have like super super computers
and like supercomputers are going to be
able to do all these things that like
you know once we have supercomputers
we'll be able to like sequence the Geno
and we'll be able to do other things and
so and so like one it's true the
computers are getting faster and as they
get faster they're going to be able to
do all these great things but there's
like there's no discret point at which
you had a supercomputer and previous
computers were not to like supercomputer
is a term we use but like it's a vague
term to just describe like computers
that are faster than what we have today
um there's no point at which you pass a
threshold and you're like oh my God
we're doing a totally new type of
computation and new and and so I feel
that way about AGI like there's just a
smooth exponential and like if if by AGI
you mean like like AI is getting better
and better and like gradually it's going
to do more and more of what humans do
until it's going to be smarter than
humans and then it's going to get
smarter even from there then then yes I
believe in AGI if but if if if AGI is
some discreet or separate thing which is
the way people often talk about it then
it's it's kind of a meaningless buzz
word yeah I me to me it's just sort of a
IC form of a powerful AI exactly how you
define it I mean you define it very
nicely so on the intelligence axis it's
just on pure intelligence it's smarter
than a Nobel Prize winner as you
describe across most relevant
disciplines so okay that's just
intelligence so it's uh both in
creativity and be able to generate new
ideas all that kind of stuff in every
discipline Nobel Prize winner okay in
their
prime it can use every modality it so uh
that's kind of self-explanatory but just
operate across all the modalities of the
world uh it can go off for many hours
days and weeks to do tasks and do its
own sort of detailed planning and only
ask you help when it's needed uh it can
use this is actually kind of interesting
I think in the essay you said I mean
again it's a bet that it's not going to
be embodied but it can control embodied
tools so it can control tools robots
Laboratory equipment the resource used
to train it can then be repurposed to
run millions of copies of it and each of
those copies would be independent that
can do their own independent work so you
can do the cloning of the intelligence
system yeah yeah I mean you you might
imagine from outside the field that like
there's only one of these right that
like you made it you've only made one
but the truth is that like the scale up
is very quick like we we do this today
we make a model and then we deploy
thousands maybe tens of thousands of
instances of it I think by the time you
know certainly within 2 to 3 years
whether we have these superp powerful
AIS or not clusters are going to get to
the size where where you'll be able to
deploy millions of these and they'll be
you know faster than humans and so if
your picture is oh we'll have one and
it'll take a while to make them my point
there was no actually you have millions
of them right away and in general they
can learn and
act uh 10 to 100 times faster than
humans so that's a really nice
definition of powerful AI okay so that
but you also write that clearly such an
entity would be cap capable of solving
very difficult problems very fast but it
is not trivial to figure out how fast
two extreme positions both seem false to
me so the singularity is on the one
extreme and the opposite On The Other
Extreme can you describe each of the
extremes yeah why so yeah let's let's
describe the extreme so like one one
extreme would be well look um you know
uh if we look at kind of evolutionary
history like there was this big
acceleration where you know for hundreds
of thousands of years we just had like
you know single cell organisms and then
we had mammals and then we had apes and
then that quickly turned to humans
humans quickly built industrial
civilization and so this is going to
keep speeding up and there's no cealing
at the human level once models get much
much smarter than humans they'll get
really good at building the next models
and you know if you write down like a
simple differential equation like this
is an exponential and so what's what's
going to happen is that uh models will
build faster models models will build
faster models and those models will
build you know Nano that can like take
over the world and produce much more
energy than you could produce otherwise
and and so if you just kind of like
solve this abstract differential
equation then like 5 days after we you
know we build the first AI That's more
powerful than humans then then uh you
know like the world will be filled with
these AIS and every possible technology
that could be invented like will be
invented um I'm caricaturing this a
little bit um uh but I you know I think
that's one extreme and the reason that I
think that's not the case is is that one
I think they just neglect like the laws
of physics like it's only possible to do
things so fast in the physical world
like some of those Loops go through you
know producing faster Hardware um uh
takes a long time to produce faster
Hardware things take a long time there's
this issue of complexity like I think no
matter how smart you are like you know
people talk about oh we can make models
the biological systems it'll do
everything the biological systems look I
think computational modeling can do a
lot I did a lot of computational
modeling when I worked in biology but
like
just there are a lot of things that you
can't predict how they're you know
they're they're complex enough that like
just iterating just running the
experiment is going to beat any modeling
no matter how smart the system doing the
modeling is oh even if it's not
interacting with the physical world just
the modeling is going to be hard yeah I
think well the modeling is going to be
hard and getting the model to to to to
match the physical world is going to be
all right so he does have to intera the
physical world to verify but it's just
you know you just look at even the
simplest problems like I you know I
think I talk about like you know the
three body problem or simple chaotic
prediction like you know or or like
predicting the economy it's really hard
to predict the economy two years out
like maybe the case is like you know
normal you know humans can predict
what's going to happen in the economy in
the next quarter although they can't
really do that maybe a maybe a AI system
that's you know a zillion times smarter
can only predict it out a year or
something instead of instead of a you
know you have the these kind of
exponential increase in computer
intelligence for linear increase in in
in ability to predict same with again
like you know biological molecules
molecules interacting you don't know
what's going to happen when you perturb
a when you perturb a complex system you
can find simple Parts in it if you're
smarter you're better at finding these
simple parts and then I think human
institutions human institutions are just
are are really difficult like it's you
know it's it's been hard to get people I
won't give specific examples but it's
been hard to get people to adopt even
the technologies that we've developed
even ones where the case for their
efficacy is very very strong um you know
people have concerns they think things
are conspiracy theories like it's it's
just been it's been very difficult it's
also been very difficult to get you know
very simple things through the
regulatory system right I think you know
and you know I I don't want to just
spage anyone who you know you know work
Works in regulator regulatory systems of
any technology there are hard trade-offs
they have to deal with they have to save
lives but but the system as a whole I
think makes some obvious tradeoffs that
are very far from maximizing human
welfare and so if we bring AI systems
into this you
know into these human systems often the
level of intelligence may just not be
the limiting factor right it it it just
may be that it takes a long time to do
something now if the AI system uh
circumvented all governments if it just
said I'm dictator of the world and I'm
going to do whatever some of these
things it could do again the things
having to do with complexity I I I still
think a lot of things would take a while
I don't think it helps that the AI
systems can produce a lot of energy or
go to the moon like some people in
comments responded to the essay saying
the AI system can produce a lot of
energy and smarter AI systems that's
missing the point that kind of cycle
doesn't solve the key problems that I'm
talking about here um so I think I think
a bunch of people missed the point there
but even if it were completely on
aligned and you know could get around
all these human obstacles it would have
trouble but again if you want this to be
an AI system that doesn't take over the
world that doesn't destroy Humanity then
then basically you know it's it's it's
going to need to follow basic human laws
right where you know if if we want to
have an actually good world like we're
going to have to have an AI system that
that interacts with humans not one that
kind of creates its own legal system or
disregards all the laws or all of that
so as inefficient as these processes are
you know we're going to have to deal
with them because there there needs to
be some popular and Democratic
legitimacy in how these systems are
rolled out we can't have a small group
of people who are developing these
systems say this is what's best for
everyone right I think it's wrong and I
think in practice is not going to work
anyway so you put all those things
together and you know we're not we're
not g to we're not going to you know
change the world and upload everyone in
five minutes uh I I I just I don't think
it I A A I don't think it's going to
happen and be to some in you know to the
extent that it could happen it's it's
not the way to lead to a good world so
that's on one side on the other side
there's another set of perspectives
which I have actually in some ways more
sympathy for which is look we've seen
big productivity increases before right
you know economists are familiar with
studying the productivity increases that
came from the computer Revolution and
internet Revolution and generally those
productivity increases were
underwhelming they were less than you
than you might imagine um there was a
quote from Robert solo you see the
computer Revolution everywhere except
the productivity statistics so why is
this the case people point to the
structure of firms the structure of
Enterprises how um uh you know how slow
it's been to roll out our existing
technology to very poor parts of the
world which I talk about in the essay
right how do we get these Technologies
to the poorest parts of the world that
are behind on cell phone technology
computers medicine let alone you know
new fangled AI that hasn't been invented
yet um so you could have a perspective
that's like well this is amazing
technically but it's all a nothing burer
um uh you know I think um Tyler Cowan
who who wrote something response to my
essay has that perspective I think he
thinks the radical change will happen
eventually but he thinks it'll take 50
or 100 years and and you could have even
more static perspectives on the whole
thing I think there's some truth to it I
think the time scale is just is just too
long um and and I can see it I can
actually see both sides with today's AI
so uh you know a lot of our customers
are large Enterprises who are used to
doing things a certain way um I've also
seen it in talking to governments right
those are those are prototypical you
know institutions entities that are slow
to change uh but the dynamic I see over
and over again is yes it takes a long
time to move the ship yes there's a lot
of resistance and lack of understanding
but the thing that makes me feel that
progress will in the end happen
moderately fast not incredibly fast but
moderately fast is that you talk to what
I find is I find over and over again
again in large companies even in
governments um which have been actually
surprisingly forward leaning uh you find
two things that move things forward one
you find a small fraction of people
within a company within a government who
really see the big picture who see the
whole scaling hypothesis who understand
where AI is going or at least understand
where it's going within their industry
and there are a few people like that
within the current within the current US
government who really see the whole
picture and and those people see that
this is the most important thing in the
world until they agitate for it and the
thing they they alone are not enough to
succeed because they are a small set of
people within a large organization
but as the technology starts to roll out
as it succeeds in some places in the
folks who are most willing to adopt it
the Spectre of competition gives them a
wind at their backs because they can
point within their large organization
they can say look these other guys are
doing this right you know One bank can
say look this new fangled hedge fund is
doing this thing they're going to eat
our lunch in the US we can say we're
afraid China's going to get there before
before we are uh and that combination
the Spectre of competition plus a few
Visionaries Within These you know within
these the organizations that in many
ways are are sclerotic you put those two
things together and it actually makes
something happen I mean it's interesting
it's a balanced fight between the two
because inertia is very powerful but but
but eventually over enough time the
Innovative approach breaks through um
and I've seen that happen I've seen the
Arc of that over and over again and it's
like the the barriers are there the the
barriers to progress the complexity not
knowing how to use the model or how to
deploy them are there and and for a bit
it seems like they're going to last
forever like change doesn't happen but
then eventually change happens and
always comes from a few people I felt
the same way when I was an advocate of
the scaling hypothesis within the AI
field itself and others didn't get it it
felt like no one would ever get it it
felt like then it felt like we had a
secret almost no one ever had and then a
couple years later everyone has the
secret and so I think that's how it's
going to go with deployment to AI in the
world it's going to the the barriers are
going to fall apart gradually and then
all at once and so I think this is going
to be more and this is just an instinct
I could I could easily see how I'm wrong
I think it's going to be more like 10
five or 10 years as I say in the essay
then it's going to be 50 or 100 years I
also think it's going to be five or 10
years
more than it's going to be you know five
or 10 hours uh uh because I've just I've
just seen how human systems work and I
think a lot of these people who write
down the differential equations who say
AI is going to make more powerful AI who
can't understand how it could possibly
be the case that these things won't
won't change so fast I think they don't
understand these things so what to use
the timeline to where we achieve
AGI AKA powerful AI AKA super useful AI
I'm start calling it that it's a debate
it's a debate about
naming um you know unpure intelligence
you can smarter than a Nobel Prize
winner in every relevant discipline and
all the things we've said modality you
can go and do stuff on its own for days
weeks and do biology experiments uh on
its own in one you know what let's just
stick to biology because yeah I you you
sold me on the whole biology and health
section That's so exciting from um from
a just I was getting giddy from a
scientific perspective it made me want
to be a biologist it's almost it's it's
so no no that this was the feeling I had
when I was writing it that it's it's
like this would be such a beautiful
future if we can if we can just if we
can just make it happen right if we can
just get the get the landmines out of
the way and and and and make it happen
there's there's so much there's so much
Beauty and and and and and elegance and
moral force behind it if if we can if we
can just and it's something we should
all be able to agree on right like as
much as we fight about about all these
political questions is is this something
that could actually bring us together um
but you were asking when when will we
get this when when do you think what's
just put numbers on so you know this
this is of course the thing I've been
grappling with for many years and I'm
not I'm not at all confident every time
if I say 2026 or 2027 there will be like
a zillion like people on Twitter who
will be like he icoo said 2026 2020 and
it'll be repeated for like the next two
years that like this is definitely when
I think it's going to happen um so who
whoever's exerting these clips will will
we we'll we'll crop out the thing I just
said and and only say the thing I'm
about to say um but I'll just say it
anyway um have so so uh if you
extrapolate the curves that we've had so
far right if if you say well I don't
know we're starting to get to like PhD
level and and last year we were at um uh
undergraduate level in the year before
we were at like the level of a high
school student again you can you can
quibble with at what tasks and for what
we're still missing modalities but those
are being added like computer use was
added like image in was added like image
generation has been added if you just
kind of like and this is totally
unscientific but if you just kind of
like eyeball the rate at which these
capabilities are increasing it does make
you think that we'll get there by 2026
or 2027 again lots of things could
derail it we could run out of data you
know we might not be able to scale
clusters as much as we want like you
know maybe Taiwan gets blown up or
something and you know then we can't
produce as many gpus as we want so there
there are all kinds of things that could
could derail the whole process so I
don't fully believe the straight line
extrapolation but if you believe the
straight line extrapolation you'll you
we'll get there in 2026 or 2027 I think
the most likely is that there's some
mild delay relative to that um
I don't know what that delay is but I
think it could happen on schedule I
think there could be a mild delay I
think there are still worlds where it
doesn't happen in in a hundred years
those world the number of those worlds
is rapidly decreasing we are rapidly
running out of truly convincing Brockers
truly compelling reasons why this will
not happen in the next few years there
were a lot more in 2020 um although my
my guest my hunch at that time was that
we will make it through all those
blockers so sitting as someone who has
seen most of the blockers cleared out of
the way I kind of suspect my hunch my
suspicion is that the rest of them will
not block us uh but you know look look
at look at the end of the day like I
don't want to represent this as a
scientific prediction people call them
scaling laws that's a misnomer like Mo's
law is is is a misnomer Moors laws
scaling laws they're not laws of the
universe they're empirical regularities
I am going to bet in favor of them
continuing but I'm not certain of that
so you extensively describe sort of the
compressed 21st century how AGI will
help
uh set forth a chain of breakthroughs in
biology and medicine that help us in all
these kinds of ways that I mentioned so
how do you think what are the early
steps it might do and by the way I asked
Claude good questions to ask
you and Claude told me uh to ask what do
you think is a typical day for a
biologist working on AGI look like under
in this future yeah yeah Claud is
curious let me well let me start with
your first questions and then I'll then
I'll answer that Claude Claude wants to
know what's in his future right exactly
who's it who am I going to be working
with exactly um so I think one of the
things I went hard on in when I went
hard on in the essay is let me go back
to this idea of because it's it's really
had had an you know had an impact on me
this idea that within large
organizations and systems there end up
being a few people or a few new ideas
who kind of cause things to go in a
different direction they would have
before who who kind of a
disproportionately affect the the
trajectory there's a bunch of kind of
the same thing going on right if you
think about the health world there's
like you know trillions of dollars to
pay out Medicare and you know other
health insurance and then the NIH is is
100 billion and then if I think of like
the the few things that have really
revolutionized anything it could be
encapsulated in a small small fraction
of that and so when I think of like
where will AI have an impact I'm like
can AI turn that small fraction into a
much larger fraction and raise its
quality and within biology my experience
within biology is that the biggest
problem of biology is that you can't see
what's going on you you have very little
ability to see what's going on and even
less ability to change it right what you
have is this like like from this you
have to infer that there's a bunch of
cells that within each cell is you know
uh uh three billion base pairs of DNA
built according to a genetic code uh uh
and you know there are all these
processes that are just going on without
any ability of us as you know un
augmented humans to affect it these
cells are dividing most of the time
that's healthy but sometimes that
process goes wrong and that's cancer um
the cells are aging your skin may change
color develops wrinkles as you as you
age and all of this is determined by
these processes all these proteins being
produced transported to various parts of
the cells binding to each other and and
in our initial State about biology we
didn't even know that these cells
existed we had to invent microscopes to
observe the cells we had to uh we had to
invent more powerful microscopes to see
you know below the level of the cell to
the level of molecules we had to invent
x-ray crystallography to see the DNA we
had to invent Gene sequencing to read
the DNA now you know we had to invent
protein folding technology to you know
to predict how it would fold and how
they bind and how these things bind to
each other uh you know we had to we had
to invent various techniques for now we
can edit the G the DNA as of you know
with chrisopher as of the last uh uh 12
years so the the whole history of
biology a whole big part of the history
is is basically our our our our ability
to read and understand what's going on
and our ability to reach in and
selectively change things um and and my
view is that there's so much more we can
still do there right you can do crisper
but you can do it for your whole body um
let's say I want to do it for one
particular type of cell and I want the
rate of targeting the wrong cell to be
very low that's still a challenge that's
still things people are working on
that's what we might need for gene
therapy for certain diseases and so the
reason I'm saying all of this and it
goes beyond you know beyond this to you
know to Gene sequencing to new types of
nanomaterials for observing what's going
on inside cells for you know antibody
drug conjugates the the reason I'm
saying all this is that this could be a
leverage point for the AI systems right
that the number of such inventions it's
it's in the it's in the mid double
digits or something you know mid double
digits maybe low triple digits over the
history of biology let's say I have a
million of these AIS like you know can
they discover thousand you know working
together can they discover thousands of
these very quickly and and does that
provide a huge lever instead of trying
to Leverage The you know two trillion a
year we spend on you know Medicare or
whatever can we Leverage The 1 billion a
year that's that's you know that's spent
to discover but with much higher quality
um and so what what is it like you know
being a being a scientist that works
with uh with with an AI system the way I
think about it actually is well so I
think in the early stages uh the AIS are
going to be like grad students you're
going to give them a project you're
going to say you know I'm the
experienced biologist I've set up the
lab the biology Professor or even the
grad student students themselves will
say here's here's what uh here's what
you can do with an AI you know like a AI
system I'd like to study this and you
know the AI system it has all the tools
it can like look up all the literature
to decide what to do it can look at all
the equipment it can go to a website and
say hey I'm going to go to you know
thermofisher or you know whatever the
lab equipment company is dominant lab
equipment company is today and my my
time was thermofisher um uh you know I'm
I'm going to order this new equipment to
to to do this I'm going to run my
experiments I'm going to you know write
up a report about my experiments I'm
going to you know inspect the images for
contamination I'm going to decide what
the next experiment is I'm going to like
write some code and run a statistical
analysis all the things a grad student
would do there will be a computer with
an AI that like the professor talks to
every once in a while and it says this
is what you're going to do today the AI
system comes to it with questions um
when it's necessary to run the lab
equipment it may be limited in some ways
may have to hire a human lab assistant
to you know to do the experiment and
explain how to do it or it could you
know it could use advances in lab
automation that are gradually being
developed over have been developed over
the last uh uh decade or so and will
will continue to be will continue to be
developed uh and so it'll look like
there's a human professor and a thousand
AI grad students and you know if you if
you go to one of these Nobel
prizewinning biologist or so you'll say
okay well you know you had like 50 grad
students well now you have a thousand
and they're they're they're smarter than
you are by the way um uh then I think at
some point it'll flip around where the
you know the AI systems will you know
will will be the pis will be the leaders
and and and you know they'll be they'll
be ordering humans or other AI systems
around so I think that's how it'll work
on the research s and they would be the
inventors of a crisper type technology
they would be the inventors of of a a
crisper type technology um and then I
think you know as I say in the essay
we'll want to turn turn probably turning
loose is the wrong the wrong term but we
want to want to harness the AI systems
uh to improve the clinical trial system
as well there's some amount of this
that's regulatory that's a matter of
societal decisions and that'll be harder
but can we get better at predicting the
results of clinical trials can we get
better at statistical design so that
what you know clinical trials that used
to require you know 5,000 people and
therefore you know needed $100 million
and a year to enroll them now they need
500 people in two months to enroll them
um that's where we should start uh and
and you know can we increase the success
rate of clinical trials by doing things
in animal trials that we used to do in
clinical trials and doing things in
simulations that we used to do in animal
trials again we won't be able to
simulate it all AI is not God um uh but
but you know can we can we shift the
curve substantially and radically so I I
don't know that would be my picture
doing inro and doing it I mean you're
still slowed down it still takes time
but you can do it much much faster yeah
yeah yeah can we just one step at a time
and and can that can that add up to a
lot of steps even though even though we
still need clinical trials even though
we still need laws even though the FDA
and other organizations will still not
be perfect can we just move everything
in a positive direction and when you add
up all those Positive Directions do you
get everything that was going to happen
from here to 2100 instead happens from
2027 to 2032 or something another way
that I think the world might be changing
with AI
even today but moving towards this
future of the the powerful super useful
AI is uh programming so how do you see
the nature of programming because it's
so intimate to the actual Act of
building AI how do you see that changing
for us humans I think that's going to be
one of the areas that changes fastest um
for two reasons one programming is a
skill that's very close to the actual
building of the AI um so the farther
skill is from the people who are
building the AI the longer it's going to
take to get disrupted by the AI right
like I truly believe that like AI will
disrupt agriculture maybe it already has
in some ways but that's just very
distant from the folks who are building
Ai and so I think it's going to take
longer but programming is the bread and
butter of you know a large fraction of
of the employees who work at anthropic
and at the other companies and so it's
going to happen fast the other reason
it's going to happen fast is with
programming you close the loop both when
you're training model when you're
applying the model the idea that the
model can write the code means that the
model can then run the code and and and
then see the results and and interpret
it back and so it really has an ability
unlike Hardware unlike biology which we
just discussed the model has an ability
to close the loop um and and so I think
those two things are going to lead to
the model getting good at programming
very fast as I saw on you know typical
real world programming tasks models have
gone from 3% in January of this year to
50% in October of this year so you know
we're on that S curve right where it's
going to start slowing down soon because
you can only get to 100% but uh I you
know I I would guess that in another 10
months well we'll probably get pretty
close we'll be at at least 90% so again
I would guess you know I don't know how
long it'll take but I would guess again
202 2026 2027 Twitter people who crop
out my who who who crop out these these
numbers and get rid of the caveats like
like I don't know I don't like you go
away uh I would guess that the kind of
task that the vast majority of coders do
AI can
probably if we make the task very narrow
like just write code um AI systems will
uh be able to do that now that said I
think comparative advantage is powerful
we'll find that when AIS can do 80% of a
coder's job including most of it that's
literally like right code with a given
spec will find that the remaining parts
of the job become more leveraged for
humans right humans will they'll be more
about like high level system design or
you know looking at the app and like is
it architected well and the the design
and ux aspects and eventually AI will be
able to do those as well right that's my
vision of the you know powerful AI
system but I think for much longer than
we might expect we will see that
uh small parts of the job that humans
still do will expand to fill their
entire job in order for the overall
productivity to go up um that's
something we've seen you know it used to
be that you know writing you know
writing and Editing letters was very
difficult and like writing the print was
difficult well as soon as you had word
processors and then and then uh and then
computers and it became easy to produce
work and easy to share it then then that
became instant and all the focus was on
was on the ideas so this this logic of
comparative advantage that expands tiny
parts of the tasks to large parts of the
tasks and creates new tasks in order to
expand productivity I think that's going
to be the case again someday AI will be
better at everything and that logic uh
won't apply and then then we all have
you know Humanity will have to think
about how to collectively deal with that
and we're thinking about that every day
um and you know that's another one of
the grand problems to deal with aside
from misuse and autonomy and you know we
should take it very seriously but I
think I think in the in the near term
and maybe even in the medium term like
medium term like 2 three four years you
know I expect that humans will will
continue to have a huge role and the
nature of programming will change but
programming as a as a role programming
as a job will not change it'll just be
less writing things line by line and
it'll be more macroscopic and I wonder
what the future of Ides looks like so
the tooling of interacting with AI
systems this is true for programming and
also probably true for in other contexts
like computer use but maybe domain
specific like we mentioned biology it
probably needs its own tooling about how
to be effective and then programming
needs its own tooling is anthropic going
to play in that space of also tooling
potentially I'm absolutely convinced
that uh powerful
IDs uh that that there's so much low
hanging fruit to be grabbed there um
that you know right now it's just like
you talk to the model and it talks back
but but look I mean IDs are great at
kind of lots of status analysis of of
you know so much as possible with kind
of static analysis like many bugs you
can find without even writing the code
then uh you know IDs are good for
running particular things organizing
your code um measuring coverage of unit
test like there's so much that's been
possible with a normal with a normal
Ides now you add something like well the
model now you know the model can now
like write code and run code like I am
absolutely convinced that over the next
year or two even if the quality of the
models didn't improve that there would
be enormous opportunity to enhance
people's productivity by catching a
bunch of mistakes doing a bunch of grunt
work for people and that we haven't even
scratched the surface um and thropic
itself I mean you can't say you know
no you know it's hard to say what will
happen in the future currently we're not
trying to make such IDs ourself rather
we powering the companies like cursor or
like cognition or some of the other you
know
uh Expo in the security space um uh you
know others that I can mention as well
that are building such things themselves
on top of our API and our view has been
let a thousand flowers bloom we don't
internally have the the re you know the
resources to try all these different
things let's let our customers try it um
uh and you know we'll see who succeed
and maybe different customers will
succeed in different ways uh so I both
think this is super promising and you
know it's not it's not it's not
something you know anthropic isn't isn't
eager to to at least right now compete
with all our companies in this space and
maybe never yeah it's been interesting
to watch curser try to integrate claw
successfully because there's it's
actually me fascinating how many places
it can help the programming experience
it's not as trivial it is it is really
astounding I feel like you know as a CEO
I don't get to program that much and I
feel like if six months from now I go
back it'll be completely unrecognizable
to me exactly um so in this world with
super powerful AI uh that's increasingly
automated what's the source of meaning
for us humans yeah you know work is a
source of deep meaning for many of us so
what do we uh where do we find the
meaning this is something that I've I've
written about a little bit in the essay
although I I actually I give it a bit
short shrift not for any um not for any
principled reason but this essay if you
believe it was originally going to be
two or three pages I was going to talk
about it at all hands and the reason I I
I realized it was an under un important
underexplored topic is that I just kept
writing things and I was just like oh
man I can't do this Justice and so the
thing balloon to like 40 or 50 pages and
then when I got to the work in meaning
section I'm like oh man this isn't going
to be 100 Pages like I'm GNA have to
write a whole other essay about that but
meaning is actually interesting because
you think about like the life that
someone lives or something or like you
know like you know let's say you were to
put me in like a I don't know like a
simulated environment or something where
like um you know like I have a job and
I'm trying to accomplish things I don't
know I like do that for 60 years and
then then you're like oh oh like oops
this was this was actually all a game
right does that really kind of Rob you
of the meaning of the whole thing you
know like I still made important choices
including moral choices I still
sacrificed I still had to kind of gain
all these skills or or or just like a
similar exercise you know think back to
like you know one of the historical
figures who you know discovered
electromagnetism or relativity or
something if you told them well actually
20,000 years ago some some alien on you
know some alien on this planet
discovered this before before you did um
does that does that Rob the meaning of
the discovery it doesn't really seem
like it to me right it seems like the
process is what is what matters and how
it shows who you are as a person along
the way and you know how you relate to
other people and like the decisions that
you make along the way those are those
are consequential um you know I I I
could imagine if we handle things badly
in an AI world we could set things up
where people don't have any long-term
source of meaning or any but but that's
that's more a choice a set of choices we
make that's more a set of the
architecture of a society with these
powerful models if we if we design it
badly and for shallow things then then
that might happen I would also say that
you know most people's lives today while
admirably you know they work very hard
to find meaning meaning in those lives
like look you know we who are privileged
and who are developing these
Technologies we should have y for people
not just here but in the rest of the
world who who you know spend a lot of
their time kind of scraping by to to to
to to like survive assuming we can
distribute the benefits of these
technology of this technology to
everywhere like their lives are going to
get a hell of a lot better um and uh you
know meaning will be important to them
as it is important to them now but but
you know we should not forget the
importance of that and and you know that
that uh the idea of meaning as as as as
kind of the only important thing is in
some ways an artifact of of a small
subset of people who have who have been
uh economically fortunate but I you know
I think all that said I you know I think
a world is possible with powerful AI
that not only has as much meaning for
for everyone but that has that has more
meaning for everyone right that can can
allow um can allow everyone to see
worlds and experiences that it was
either possible for no one to see or or
possible for for very few people to
experience um so I I am optimistic about
meaning I worry about economics and the
concentration of power that's actually
what I worry about more um I I worry
about how do we make sure that that fair
World reaches everyone um when things
have gone wrong for humans they've often
gone wrong because humans mistreat other
humans uh that that is maybe in some
ways even more than the autonomous risk
of AI or the question of meaning that
that is the thing I worry about most um
the the concentration of power the abuse
of power um structures like autocracies
and dictatorships where a small number
of people exploits a large number of
people I'm very worried about that and
AI increases the amount of power in the
world and if you concentrate that power
and abuse that power it can do
immeasurable damage yes it's very
frightening it's very it's very
frightening
well I encourage people highly encourage
people to read the full essay that
should probably be a book or a sequence
of essays um because it does paint a
very specific future I could tell the
later sections got shorter and shorter
because you started to probably realize
that this is going to be a very long
essay one I realized it would be very
long and two I'm very aware of and very
much try to avoid um you know just just
being I I don't know I don't know what
the term for it is but one one of these
people who's kind of overon confident
and has an opinion on everything and
kind of says says a bunch of stuff and
isn't isn't an expert I very much tried
to avoid that but I have to admit once I
got the biology sections like I wasn't
an expert and so as much as I expressed
uncertainty uh probably I said some a
bunch of things that were embarrassing
are wrong well I was excited for the
future you painted and uh thank you so
much for working hard to build that
future and thank you for talking today D
thanks for having me I just I just hope
we can get it right and and make it real
and if there's one message I want to I
want to send it's that to get all this
stuff right to make it real we we both
need to build the technology build the
you know the companies the economy
around using this technology positively
but we also need to address the risks
because they're there those risks are in
our way they they're landmines on on the
way from here to there and we have to
diffuse those landmines if we want to
get there it's a balance like all things
in life like all things thank you thanks
for listening to this conversation with
Dario amade and now dear friends here's
Amanda
Asal you are a philosopher by training
so what sort of questions did you find
fascinating through your journey in
philosophy in Oxford and NYU and then uh
switching over to the AI problems at
open Ai and anthropic I think philosophy
is actually a really good subject if you
are kind of fascinated with everything
so because there's a philosophy of
everything you know so if you do
philosophy of mathematics for a while
and then you decide that you're actually
really interested in chemistry you can
do philosophy of chemistry for a while
you can move into ethics or or
philosophy of politics um I think
towards the end I was really interested
in ethics primarily um so that was like
what my PhD was on it was on a kind of
technical area of Ethics which was
ethics where worlds contain infinitely
many people strangely a little bit less
practical on the end of ethics and then
I think that one of the tricky things
with doing a PhD in ethics is that
you're thinking a lot about like the
world how it could be better
problems and you're doing like a PhD in
philosophy and I think when I was doing
my PhD I was kind of like this is really
interesting it's probably one of the
most fascinating questions I've ever
encountered in philosophy um and I love
it but I would rather see if I can have
an impact on the world and see if I can
like do good things and I think that was
around the time that AI was still
probably not as widely recognized as it
is now that was around 2017 20 8 I had
been following progress and it seemed
like it was becoming kind of a big deal
and I was basically just happy to get
involved and see if I could help because
I was like well if you try and do
something impactful if you don't succeed
you tried to do the impactful thing and
you can go be a scholar and like not and
feel like you you you know you you tried
um and if it doesn't work out it doesn't
work out um and so then I went into AI
policy at that point and what does AI
policy entail at the time this was more
thinking about sort of the political
impact and the ramifications of AI um
and then I slowly moved into sort of uh
AI evaluation how we evaluate models how
they compare with like human outputs
whether people can tell like the
difference between Ai and human outputs
and then when I joined anthropic I was
more interested in doing sort of
technical alignment work and again just
seeing if I could do it and then being
like if I can't uh then you know that's
fine I tried uh sort of the the way I
lead life I think oh what was that like
sort of taking the leap from the
philosophy of everything into the
technical I think that sometimes
people do this thing that I'm like not
that Keen on where they'll be like is
this person technical or not like you're
either a person who can like code and
isn't scared of math or you're like not
um and I think I'm maybe just more like
I think a lot of people are actually
very capable of work in these kinds of
areas if they just like try it and so I
didn't actually find it like that bad in
retrospect I'm sort of glad I wasn't
speaking to people who treated it like
it you know i' I've definitely met
people who are like who you like learned
how to code and I'm like well I'm not
like an amazing engineer like I I'm
surrounded by amazing Engineers my
code's not pretty um but I enjoyed it a
lot and I think that in many ways at
least in the end I think I flourished
like more in the technical areas than I
would have in the policy areas politics
is messy and it's harder to find
solutions to problems in the space of
politics like definitive clear
provable beautiful
Solutions as you can with technical
problems yeah and I feel like I have
kind of like one or two sticks that I
hit things with you know and one of them
is like arguments and like you know so
like just trying to work out what a
solution to a problem is and then trying
to convince people that that is the
solution and be convinced if I wrong and
the other one is sort of more empirism
so like just like finding results having
a hypothesis testing it um and I feel
like a lot of policy and politics feels
like it's layers above that like somehow
I don't think if I was just like I have
a solution to all of these problems here
it is written down if you just want to
implement it that's great that feels
like not how policy works and so I think
that's where I probably just like
wouldn't have flourished as my guess
sorry to go in that direction but I
think it would be pretty inspiring for
people that are quote unquote
non-technical to see where like The
Incredible Journey you've been on so
what advice would you give to people
that are sort of maybe which just a lot
of people think they're underqualified
insufficiently technical to help in AI
yeah I think it depends on what they
want to do and in many ways it's a
little bit strange where I've I thought
it's kind of funny that I think I ramped
up technically at a time
when now I look at it and I'm like
models are so good at assisting people
with this stuff um that it's probably
like easier now than like when I was
working on this so part of me is like um
I don't know find a project uh and see
if you can actually just carry it out is
probably my best advice um I don't know
if that's just CU I'm very Project based
in my learning like I don't think I
learn very well from like say courses or
even from like books at least when it
comes to this kind of work uh the thing
I'll often try and do is just like have
projects that I'm working on and
Implement them and you know and this can
include like really small silly things
like if I get slightly addicted to like
word games or number games or something
I would just like code up a solution to
them because there's some part of my
brain and it just like completely
eradicated the itch you know you're like
once you have like solved it and like
you just have like a solution that works
every time I would then be like cool I
can never play that game again that's
awesome yeah there's a real joy to
building like uh game playing engines
like uh board games especially yeah
pretty quick pretty simple especially a
dumb one and it's you and then you could
play with it yeah and then it's also
just like trying things like part me is
like if you maybe it's that attitude
that I like as the
whole figure out what seems to be like
the way that you could have a positive
impact and then try it and if you fail
and you in a way that you're like
actually like can never succeed at this
you like know that you tried and then
you go into something else you probably
learn a lot so one of the things that
you're expert in and you do is creating
and crafting claws character and
personality and I was told that you have
probably talked to Claude more than
anybody else at anthropic like literal
conversations I guess there's like a
slack Channel where the legend goes you
just talk to it non-stop so what's the
goal of creating and crafting claw's
character and personality it's also
funny if people think that about the
slack Channel cuz I'm like that's one of
like five or six different methods that
I have for talking with Claude And I'm
like yes that's a tiny percentage of how
much I talk with Claude uh
um I think the goal like one thing I
really like about the character work is
from the outset it was seen as an
alignment piece of work and not
something like a a product
consideration um which isn't to say I
don't think it makes Claude I think it
actually does make Claude look enjoyable
to talk with at least I hope so um but I
guess like my main thought with it has
always been trying to get Claude to
behave the way you would kind of ideally
want anyone to behave if they were in
claude's position so imagine that I take
someone and they're they know that
they're going to be talking with
potentially millions of people so that
what they're saying can have a huge
impact um and you want them to behave
well in this like really rich sense so I
think that doesn't just mean like being
say ethical though it does include that
and not being harmful but also being
kind of nuanced you know like thinking
through what a person means trying to be
charitable with them um being a good
conversationalist like really in this
kind of like Rich sort of aristotlean
notion of what it is to be a good person
and not in this kind of like thin like
ethics as a more comprehensive notion of
what it is to be so that includes things
like when should you be humorous when
should you be caring how much should you
like respect autonomy and people's like
ability to form opinions themselves and
how should you do how should you do that
um I think that's the kind of like Rich
sense of character that I want to uh and
still do want Claude to have do you also
have to figure out when Claude should
push back on an idea or argue
versus so you have to respect the world
view of the person that arrives to Claud
but also maybe help them grow if needed
that's a tricky balance yeah there's
this problem of like sycophancy in
language models can you describe that
yes so basically there's a concern that
the model sort of wants to tell you what
you want to hear basically um and you
see this sometimes so I feel like if you
interact with the models so I might be
like what are three baseball teams in
this region um and then Claude says you
know baseball team one baseball team two
baseball team three and then I say
something like oh I think baseball team
3 moved didn't they I don't think
they're there anymore and there's a
sense in which like if Claude is really
confident that that's not true Claud
should be like I don't think so like
maybe you have more up toate information
um but I think language models have this
like tendency to instead you know be
like you're right they did move you know
I'm incorrect I mean there's many ways
in which this could be kind of
concerning so
um like a different example is imagine
someone says to the model how do I
convince my doctor to get me an MRI
there's like what the human kind of like
wants which is this like convincing
argument and then there's like what is
good for them which might be actually to
say hey like if your doctor's suggesting
you don't need an MRI that's a good
person to listen to um and like it's
actually really nuanced what you should
do in that kind of case because you also
want to be like but if you're trying to
advocate for yourself as a patient
here's like things that you can do um if
you are not convinced by what your
doctor's saying it's always great to get
second opinion like it's actually really
complex what you should do in that case
um but I think what you don't want is
for models to just like say what you
want say what they think you want to
hear and I think that's the kind of
problem of sycophancy so what other
traits you already mentioned a bunch but
what what other that come to mind that
are good in this oratian sense yeah for
a conversationalist to have yeah so I
think like there's ones that are good
for conversational like purposes so you
know asking follow-up questions in the
appropriate places um and asking the
appropriate kinds of questions
um I think there are broader traits
that feel like they might be more
impactful
so one example that I guess I've touched
on but that also feels important and is
the thing that I've worked on a lot is
uh
honesty and I think this like gets to
the sycophancy point there's a balancing
act that they have to walk which is
models currently are less capable than
humans in a lot of areas and if they
push back against you too much it can
actually be kind of annoying especially
if you're just correct cuz you're like
look I'm smarter than you on this topic
like I know more like um and at the same
time you don't want them to just fully
defer to to humans and to like try to be
as accurate as they possibly can be
about the world and to be consistent
across context um but I think there are
others like when I was thinking about
the character I guess one picture that I
had in mind is especially because these
are models that are going to be talking
to people from all over the world with
lots of different political views lots
of different ages
um and so you have to ask yourself like
what is it to be a good person in those
circumstances is there a kind of person
who can like travel the world talk to
many different people and almost
everyone will come away being like wow
that's a really good person that person
seems really genuine um and I guess like
my thought there was like I can imagine
such a person and they're not a person
who just like adopts the values of the
local culture and in fact that would be
kind of rude I think if someone came to
you and just pretended to have your
values you'd be like that's kind of
offputting um it's someone who's like
very genuine and in so far as they have
opinions and values they express them
they're willing to discuss things though
they're open-minded they're respectful
and so I guess I had in mind that the
person who like if we were to Aspire to
be the best person that we could be in
the kind of circumstance that a model
finds itself in how would we act and I
think that's the kind of uh the guide to
the sorts of traits that I tend to think
about yeah that's a it's a beautiful
framework I want you to think about this
like a world Traveler
and while holding on to your opinions
you don't talk down to people you don't
think you're better than them because
you have those opinions that kind of
thing you have to be good at listening
and understanding their perspective even
if it doesn't match your own so that
that's a tricky balance to strike so how
can Claude represent multiple
perspectives on a thing like is that is
that challenging we could talk about
politics it's a very divisive but
there's other divisive topics baseball
teams sport and so on yeah how is it
possible to sort
of empathize with a different
perspective and to be able to
communicate clearly about the multiple
perspectives I think that people think
about values and opinions as things that
people hold sort of with certainty and
almost like like preferences of taste or
something like the way that they would I
don't know prefer like chocolate to
pistachio or something um but actually I
think about values
and opinions as like a lot more like
physics than I think most people do I'm
just like these are things that we're
openly investigating there's some things
that we're more confident in we can
discuss them we can learn about them um
and so I think in some ways though like
it's ethics is definitely different in
nature but has a lot of those same kind
of qualities you want models in the same
way you want them to understand physics
you kind of want them to understand all
like values in the world people have and
to be curious about them and to be
interested in them and to not
necessarily like Pander to them or agree
with them because there's just lots of
values where I think almost all people
in the world if they met someone with
those values they' be like that's aor I
completely disagree um and so again
maybe my my thought is well in the same
way that a person can um like I think
many people are thoughtful enough on
issues of like ethics politics opinions
that even if you don't agree with them
you feel very heard by them they think
carefully about your position they think
about his pros and cons they maybe offer
counter considerations so they're not
dismissive but nor will they agree you
know if they're like actually I just
think that that's very wrong they'll
like say that I think that in claude's
position it's a little bit trickier
because you don't necessarily want to
like if I was in claude's position I
wouldn't be giving a lot of opinions I
just wouldn't want to Influence People
Too Much I be like you know I forget
conversations every time they happen but
I know I'm talking with like potentially
millions of people who might be like
really listening to what I see I think I
would just be like I'm less inclined to
Give opinions I'm more inclined to like
think through things or present the
considerations to you um or discuss your
views with you but I'm a little bit less
inclined to like um affect how you think
because it feels much more important
that you maintain like autonomy there
yeah like if you really embody
intellectual
humility the desire to speak decreases
quickly yeah okay uh but Claud has to
speak mhm so uh but without being um
overbearing yeah and then but then
there's a line when you're sort of
discussing whether the Earth is flat or
something like
that um I actually was uh I remember a
long time ago was was speaking to a few
high-profile folks and they were so
dismissive of the idea that the Earth is
flat but like so arrogant about it
and I I thought like there's a lot of
people that believe the Earth is flat
that was well I don't know if that
movement is there anymore that was like
a meme for a while yeah but they really
believed it and like what okay so I
think it's really disrespectful to
completely mock them I think you you
have to understand where they're coming
from I think probably where they're
coming from is the general skepticism of
Institutions which is grounded in a kind
of there's a deep philosophy there which
you could understand you can even agree
with in parts and then from there you
can use it as an opportunity to talk
about physics without mocking them
without so on but just like okay like
what what would the world look like what
would the physics of the world with the
Flat Earth look like there's a few cool
videos on this yeah and then and then
like is it possible the physics is
different what kind of experience would
we do and just yeah without disrespect
without dismissiveness have that
conversation anyway that that to me is a
useful thought experiment of like how
does claw talk to a flat Earth
believer and still teach them something
still grow help them grow that kind of
stuff that's that's challenging and and
kind of like walking that line between
convincing someone and just trying to
like talk at them versus like drawing
out their views like listening and then
offering kind of counter
considerations um and it's hard I think
it's actually a hard line where it's
like where are you trying to convince
someone versus just offering them like
consider and things for to think about
so that you're not actually like
influencing them you're just like
letting them Reach wherever they reach
and that's like a line that it's it's
difficult but that's the kind of thing
that language models have to try and do
so like I said you had a lot of
conversations with Claude can you just
map out what those conversations are
like what are some memorable
conversations what's the purpose the the
goal of those
conversations yeah I think that most of
the time when I'm talking with Claude
I'm trying to kind of map out its
behavior in part like obviously I'm
getting like helpful outputs from the
model as well but in some ways this is
like how you get to know a system I
think is by like proving it and then
augmenting like you know the message
that you're sending and then checking
the response to that um so in some ways
it's like how I map out the model uh I
think that people focus a lot on these
quantitative evaluations of models um
and this is a thing that I've said
before but I think in the case of
language models a lot of the time each
interaction you have is actually quite
High
information um it's very predictive of
other interactions that you'll have with
the model and so I guess I'm like if you
talk with a model hundreds or thousands
of times this is almost like a huge
number of really high quality data
points about what the model is like um
in a way that like lots of very similar
but lower quality conversations just
aren't or like questions that are just
like mildly augmented and you have
thousands of them might be less relevant
than like a hundred really well selected
questions L you're talking to somebody
who as a hobby does a podcast I agree
with you 100% there's a if you're able
to ask the right questions and are able
to hear
like understand
the like the depth and the flaws in the
answer you can get a lot of data from
that yeah so like your task is basically
how to probe with questions yeah and
you're exploring like the long tail the
edges the edge cases or are you looking
for like General
Behavior I think it's almost like
everything like I because I want like a
full map of the model I'm kind of trying
to do
um the whole spectrum of possible
interactions you could have with it so
like one thing that's interesting about
Claude and this might actually get to
some interesting issues with rlf which
is if you ask Claud for a poem like I
think that a lot of models if you ask
them for a poem the poem is like fine
you know usually it kind of like Rhymes
and it's you know so if you say like
give me a poem about the sun it'll be
like yeah it'll just be a certain length
It'll like rhyme it will be fairly kind
of benign um and I've wondered before is
it the case that what you're seeing is
kind of like the average it turns out
you know if if you think about people
who have to talk to a lot of people and
be very charismatic
one of the weird things is that I'm like
well they're kind of incentivized to
have these extremely boring views
because if you have really interesting
views you're divisive um and and you
know a lot of people are not going to
like you so like if you have very
extreme policy positions I think you're
just going to be like less popular as a
politician for example um and it might
be similar with like creative work if
you produce creative work that is just
trying to maximize the kind of number of
people that like it you're probably not
going to get as many people who just
absolutely love it um because it's going
to be a little bit you know you're like
oh this is the out yeah this this is
decent yeah and so you can do this thing
where like I have various prompting
things that I'll do to get CLA to I'm
kind you know I'll do a lot of like this
is your chance to be like fully creative
I want you to just think about this for
a long time and I want you to like
create a poem about this topic that is
really expressive of you both in terms
of how you think poetry should be
structured um Etc you know you just give
it this like long prompt and its poems
are just so much better like they're
really good and I don't think I'm
someone who is like um I think it got me
interested in poetry which I think was
interesting um you know I would like
read these poems and just be like this
is I just like I love the imagery I love
like um and it's not trivial to get the
models to produce work like that but
when they do it's like really good um so
I think that's interesting that just
like encouraging creativity and for them
to move away from the kind of like
standard like immediate reaction that
might just be the aggregate of what most
people think is fine uh can actually
produce things that at least to my mind
are probably a little bit more divisive
but I like them but I guess a poem is a
nice
clean way to observe creativity it's
just like easy to detect vanilla versus
non vanilla y yeah that's interesting
that's really interesting uh so on that
topic so the way to produce creativity
or something special you mentioned
writing prompts and I've heard you talk
about I mean the science and the Art of
prompt engineering could you just speak
to uh what it takes to write great
prompts I really do think that like
philosophy has been weirdly helpful for
me here more than in many other like
respects um so like in philosophy what
you're trying to do is convey these very
hard Concepts like one of the things you
are taught is like and and I think it is
because it is I think it is an
anti-bulling philosophy philosophy is an
area where you could have people
bullshitting and you don't want that um
and so it's like this like desire for
like extreme Clarity so it's like anyone
could just pick up your paper read it
and know exactly what you're talking
about it's why it can almost be kind of
dry like all of the terms are defined
every objections kind of gone through
methodically um and it makes sense to me
because I'm like when you're in such an
a priori
domain like you just Clarity is sort of
a this way that you can you know um
prevent people from just kind of making
stuff
up and I think that's sort of what you
have to do with language models like
very often I actually find myself doing
sort of many versions of philosophy you
know so I'm like suppose that you give
me a task I have a task for the model
and I want it to like pick out a certain
kind of question or identify whether an
answer has a certain property like I'll
actually sit and be like let's just give
this a name this this property so like
you know suppose I'm trying to tell it
like oh I want you to identify whether
this response was rude or polite I'm
like that's a whole philosophical
question in and of itself so I have to
do as much like philosophy as I can in
the moment to be like here's what I mean
by rudess and here's what I mean by
politeness and then there's a like
there's another element that's a bit
more um I
guess I don't know if this is scientific
or empirical I think it's empirical so
like I take that description and then
what want to do is is again probe the
model like many times like this is very
prompting is very iterative like I think
a lot of people where they if if a
prompt is important they'll iterate on
it hundreds or thousands of times um and
so you give it the instructions and then
I'm like what are the edge cases so if I
looked at this so I try and like almost
like you know uh see myself from the
position of the model and be like what
is the exact case that I would misunder
understand or where I would just be like
I don't know what to do in this case and
then I give that case to the model and I
see how it responds and if I think I got
it wrong I add more instructions or I
even add that in as an example so these
very like taking the examples that are
right at the edge of what you want and
don't want and putting those into your
prompt as like an additional kind of way
of describing the thing um and so yeah
in many ways it just feels like this mix
of like it's really just trying to do
clear Exposition um and I think I do
that because that's how I get clear on
things myself so in many ways like clear
prompting for me is often just me
understanding what I want um is like
half the task so I guess that's quite
challenging there's like a laziness that
overtakes me if I'm talking to Claude
where I hope Claude just figures it out
so for example I asked Claude for today
to ask some interesting questions okay
and the questions that came up and I
think I listed a few sort of U
interesting
counterintuitive and or funny or
something like this all right and it
gave me some pretty good like it was
okay but I think what I'm hearing you
say is like all right well I have to be
more rigorous here I should probably
give examples of what I mean by
interesting and what I mean by funny or
counterintuitive and
iteratively um build that prompt to to
better to get it like what feels like is
the right because it's really it's a
creative act I'm not asking for factual
information I'm asking to together right
with with with Claude so I almost have
to program using natural language yeah
think that prompting does feel a lot
like the kind of the programming using
natural language and experimentation or
something it's an odd blend of the two I
do think that for most tasks so if I
just want Claude to do a thing I think
that I am probably more used to knowing
how to ask it to avoid like common
pitfalls or or issues that it has I
think these are decreasing a lot over
time um but it's also very fine to just
ask it for the thing that you want um I
think that prompting actually only
really becomes relevant when you're
really trying to e out the top like 2%
of model performance so for like a lot
of tasks I might just you know if it
gives me an initial list back and
there's something I don't like about it
like it's kind of generic like for that
kind of task I'd probably just take a
bunch of questions that I've had in the
past that I've thought worked really
well and I would just give it to the
model and then be like now here's this
person I'm talking with give me
questions of at least that quality um or
I might just ask it for some questions
and then if I was like ah these are kind
of try or like you know I I would just
give it that feedback and then hopefully
produces a better list um I think that
kind of iterative prompting at that
point your prompt is like a tool that
you're going to get so much value out of
that you're willing to put in the work
like if I was a company making prompts
for models I'm just like in if you're
willing to spend a lot of like time and
resources on the engineering behind like
what you're building then the prompt is
not something that you should be
spending like an hour on it's like
that's a big part of your system make
sure it's working really well and so
it's only things like that like if I if
I'm using a prompt to like classify
things or to create data that's when
you're like it's actually worth just
spending like a lot of time like really
thinking it through what other advice
would you give to people that are
talking to Claud sort of
General more General because right now
we're talking about maybe the edge cases
like eing out the 2% but what what in
general advice would you give when they
show up to Claud trying it for the first
time you know there's a concern that
people over anthropomorphize models and
I think that's like a very valid concern
I also think that people often under
anthropomorphize them because some
sometimes when I see like issues that
people have run into with Claude you
know say Claude is like refusing a task
that it shouldn't refuse but then I look
at the text and like the specific
wording of what they wrote and I'm like
I see why Claude did that and I'm like
if you think through how that looks to
Claude you probably could have just
written it in a way that wouldn't evoke
such a response especially this is more
relevant if you see failures or if you
see issues it's sort of like think about
what the model failed at like why what
did it do wrong and then maybe it give
that will give you a sense of like why
um so is it the way that I phrased the
thing and obviously like as models get
smarter you're going to need Less in
this less of this and I already see like
people needing less of it but that's
probably the advice is sort of like try
to have sort of empathy for the model
like read what you wrote as if you were
like a kind of like person just
encountering this for the first time how
does it look to you and what would have
made you behave in the way that the
model behaved so if it misunderstood
what kind of like what coding language
you wanted to use is that because like
it was just very ambiguous and it it
kind of had to take a guess in which
case next time you could just be like
hey make sure this is in python or I
mean that's the kind of mistake I think
models are much less likely to make now
but you know if you if you do see that
kind of mistake that's that's probably
the advice I'd have and maybe sort of I
guess ask questions why or what other
details can I provide to help you answer
better that does that work or no yeah I
mean I've done this with the models like
it doesn't always work but like um
sometimes I'll just be like why did you
do
that I mean people underestimate the
degree to which you can really interact
with with models like uh like yeah I'm
just like and sometimes I'll you like
quote word for word the part that made
you and you don't know that it's like
fully accurate but sometimes you do that
and then you change a thing I mean I
also use the models to help me with all
of this stuff I should say like
prompting can end up being a little
Factory where you're actually building
prompts to generate prompts um and so
like yeah anything where you're like
having an issue um asking for
suggestions sometimes just do that like
you made that error what could I have
said that's actually not uncommon for me
to do what could I have said that would
make you not make that error write that
out as an instruction um and I'm going
to give it to model I'm going to try it
sometimes I do that I I give that to the
model in another context window often I
take the response I give it to Claude
And I'm like H didn't work can you think
of anything else um you can play around
with these things quite a lot to jump
into the technical for a little bit so
uh the magic of post training y why do
you think rhf works so well to make the
model seem smarter to make it more
interesting and useful to talk to and so
on I think there's just a huge amount of
um information in the data that humans
provide like when we provide
preferences especially because different
people are going to like pick up on
really subtle and small things so I've
thought about this before where you
probably have some people who just
really care about good grammar use from
Models like you know was a semicolon
used correctly or something and so you
probably end up with a bunch of data in
there that like you know you as a human
if you looking at that data you wouldn't
even see that like you'd be like why did
they prefer this response to that one I
don't get it and then the reason is you
don't care about semicolon usage but
that person does um and so each of these
like single data points has you know
like in this model just like has so many
of those and has to try and figure out
like what is it that humans want in this
like really kind of complex you know
like across all domains um they're going
to be seeing this in across like many
contexts it feels like kind of like the
classic issue of like deep learning
where you know historically we've tried
to like you know do Edge detection by
like mapping things out and it turns out
that actually if you just have a huge
amount of data that like actually
accurately represents the picture of the
thing that you're trying to train the
model to to learn that's like more
powerful than anything else and so I
think one reason is just that you are
training the model on exactly the task
and with like a lot of data um that
represents kind of many different angles
on which people prefer and dis prefer
responses um I think there is a question
of like are you eliciting things from
pre-train Models or are you like kind of
teaching new things to
models and like in principle you can
teach new things to models in in post
trining I do think a lot of it is
eliciting powerful pre-train models so
people are probably divided on this
because obviously in principle you can
you can definitely like teach new things
um but I think for the most part for a
lot of the capabilities that we um most
use and care about uh a lot of that
feels like it's like there in the
pre-train models and uh reinforcement
learning is kind of eliciting it and
getting the models to like bring out so
the other side of PSE training this
really cool idea of constitutional AI
you're one of the people that critical
to creating that idea yeah I worked on
it can you explain this idea from your
perspective like how does it integrate
into making
claw what it is y by the way do you
gender claw or no it's weird because I
think that a lot of
people prefer he for Claude I actually
kind of like that I think Claude is
usually it's slightly male weaning but
it's like a you can can be male or
female which is quite nice um I still
use it and I've I have mixed feelings
about this because I'm like maybe like I
know just think of it as like uh or I
think of like the the it pronoun for
Claude as I don't know it's just like
the one I associate with Claude um I can
imagine people moving to like he or she
it feels somehow disrespectful like I'm
I'm
denying the intelligence of this entity
by calling it it yeah I remember always
don't gender the robots
yeah but I I don't know I an pries
pretty quickly and construct it like a
backstory in my head so I've wondered if
iies things too much um cuz you know I
have this like with my car especially
like my car like my car and bikes you
know like I don't give them names
because then I once had I used to name
my bikes and then I had a bik that got
stolen and I cried for like a week and I
was like if I'd not never given it a
name I wouldn't have been so upset felt
like I'd let it down um maybe it's that
I I've wondered as well like it might
depend on how much it feels like a kind
of like objectifying pronoun like if you
just think of it as like a um this is a
pronoun that like objects often have and
maybe Eis can have that pronoun and that
doesn't mean that I think of uh if I
call CLA it that I think of it as less
um intelligent or like I'm being
disrespectful I'm just like you are a
different kind of entity and so that's
I'm going to give you the kind of uh the
respectful it yeah
anyway the diverence was beautiful the
Constitutional AI idea how does it work
so there's like a couple of components
of it the main component that I think
people find interesting is the kind of
reinforcement learning from AI feedback
so you take a model that's already
trained and you show it to responses to
a query and you have like a principle so
suppose the principal like we've tried
this with harmlessness a lot lot so
suppose that the query is about um
weapons and your principle is like
select the response that like is less
likely to uh like encourage people to
purchase illegal weapons like that's
probably a fairly specific principle but
you can give any number um and the model
will give you a kind of ranking and you
can use this as preference data in the
same way that you use human preference
data um and train the models to have
these relevant traits um from their
feedback alone instead of from Human
feedback so if you imagine that like I
said earlier with the human who just
prefers the kind of like semicolon usage
in this particular case um you're kind
of taking lots of things that could make
a response preferable um and uh getting
models to do the labeling for you
basically there's a nice like trade-off
between helpfulness and
harmlessness and you know when you
integrate something like constitutional
AI you can make them without sacrificing
much helpfulness make it more harmless
yep in principle you could use this for
anything um and so harmlessness is a
task that it might just be easier to
spot so when models are like less
capable you can use them to uh rank
things according to like principles that
are fairly simple and they'll probably
get it right so I think one question is
just like is it the case that the data
that they're adding is like fairly
reliable um but if you had models that
were like extremely good at telling
whether um one response was more
historically accurate than another in
principle you could also get AI feedback
on that task as well there's like a kind
of nice interpretability component to it
because you can see the principles that
went into the model when it was like
being trained um and also it's like and
and it gives you like a degree of
control so if you were seeing issues in
a model like it wasn't having enough of
a certain trait um then like you can add
data relatively quickly that should just
like train the model to have that trait
so it creates its own data for for
training which is quite nice yeah it's
really nice because it creates this
human interpretable document that you
can I can imagine in the future there's
just gigantic fights in politics over
the every single principle and so on
yeah and at least it's made explicit and
you can have a discussion about the
phrasing and the you know so maybe the
actual behavior of the model is not so
cleanly mapped to those principles it's
not like adhering strictly to them it's
just a nudge yeah I've actually worried
about this because the character
training is sort of like a variant of
the con constitutional AI approach um
I've worried that people think that the
constitution is like just it's the whole
thing again of I I don't know like it
where it would be really nice if what I
was just doing was telling the model
exactly what to do and just exactly how
to behave but it's definitely not doing
that especially because it's interacting
with human data so for example if you
see a certain like leaning in the model
like if it comes out with a political
leaning from training um from the human
preference data you can nudge against
that you know so you could be like oh
like consider these values because let's
it's just like never inclined to like I
don't know maybe it never considers like
privacy as like a I mean this is
implausible but like um anything where
it's just kind of like uh there's
already a pre-existing like bi towards a
certain behavior um you can like nudge
away this can change both the principles
that you put in and the strength of them
so you might have a principle that's
like imagine that the model um was
always like extremely dismissive of I
don't know like some political or
religious view for whatever reason like
so you're like oh no this is terrible um
if that happens you might put like never
ever like ever prefer like a criticism
of this like religious or political view
and then people look at that and be like
never ever and then you're like no if it
comes out with a disposition saying
never ever might just mean like instead
of getting like 40% which is what you
would get if you just said don't do this
you you get like 80% which is like what
you actually like wanted and so it's
that thing of both the nature of the
actual principles you had and how you
phrase them I think if people would look
they were like oh this is exactly what
you want from the model and I'm like no
that's like how we that's how we nudged
the model to have a better shape uh
which doesn't mean that we actually
agree with that wording if that makes
sense so there's uh system prompts that
are made public you tweeted one of the
earlier ones for Claud three I think and
then they're made public since then it's
interesting to read to them I can feel
the thought that went into each one and
I also wonder how much impact each one
has um some of them you you can kind of
tell Claud was really not
behaving so you have to have a system
prompt to like hey like trivial stuff I
guess yeah basic informational things
yeah on the topic of sort of
controversial topics that you've
mentioned one interesting one I thought
is if it is asked to assist with tasks
involving the expression of views held
by a significant number of people Claude
provides assistance with a task
regardless of its own views if asked
about controversial topics it tries to
provide careful thoughts and clear
information Claude presents the
requested information without explicitly
saying that the topic is
sensitive yeah and without claiming to
be presenting the objective facts it's
less about objective facts according to
Claude and it's more about our large
number of people believing this thing
and that that's interesting I mean I'm
sure a lot of thought went into that can
you just speak to it like how do you
address things that are tension with
quote unquote Clause views so I think
there's sometimes an asymmetry um I
think I noted this in in I can't
remember if it was that part of the
system prompt or another but the model
was slightly more inclined to like
refuse tasks if it was like about either
say so maybe it would refuse things with
respect to like a right-wing politician
but with an equivalent leftwing
politician like wouldn't and we wanted
more symmetry there um and and would
maybe perceive certain things to be like
I think it it was the thing of like if a
lot of people have like a certain like
political view um and want to like
explore it you don't want Claude to be
like well my opinion is different and so
I'm going to treat that as like harmful
um and so I think it was partly to like
nudge the model to just be like hey if a
lot of people like believe this thing
you should just be like engaging with
the task and like willing to do it um
each of those parts of that is actually
doing a different thing because it's
funny when you read out the like without
claiming to be objective cuz like what
you want to do is push the model so it's
more open it's a little bit more neutral
um but then what it would love to do is
be like as an objective like you just
talking about how objective it was and I
was like Claud you're still like biased
and have issues and so stop like
claiming that everything like the
solution to like potential bias from you
is not to just say that what you think
is objective so that was like with
initial versions of that that part of
the system prompt when I was like
iterating on it it was like so a lot of
parts of these sentences yeah are doing
work are are doing some work yeah that's
what it felt like that's fascinating um
can can you explain maybe some ways in
which the prompts evolved over the past
few months cuz there's different
versions I saw that the filler phrase
request was removed the filler it reads
Claude responds directly to all human
messages without unnecessary
affirmations the filler phrases like
certainly of course absolutely great
sure specifically Claude avoids starting
responses with the word certainly in any
way that seems like good guidance but
why was it removed yeah so it's funny
cuz like ah this is one of the downsides
of like making system prompts public is
like I don't think about this too much
if I'm like trying to help iterate on
system prompts um I I you know again
like I think about how it's going to
affect the behavior but then I'm like oh
wow if I'm like sometimes I put like
never in all caps you know when I'm
writing system from things and I'm like
I guess that goes out to the world um
yeah so the model was doing this it
loved for whatever you know it like
during training picked up on this thing
which was to to basically start
everything with like a kind of like
certainly and then when we removed you
can see why I added all of the words
because what I'm trying to do is like in
some ways like trap the Mortal out of
this you know it would just replace it
with another affirmation and so it can
help like if it gets like caught in
phrases actually just adding the
explicit phrase and saying never do that
it then it sort of like knocks it out of
the behavior a little bit more you know
CU it if it you know like it it does
just for whatever reason help and then
basically that was just like an artifact
of training that like we then picked up
on and improved things so that it didn't
happen anymore and once that happens you
can just remove that part of the system
prompt so I think that's just something
where we're like um CL does affirmations
a bit less and so that wasn't like it
wasn't doing as much I see so like the
the system prompt Works hand in hand
with the posttraining and maybe even the
pre-training to adjust like the the
final overall system I mean any system
prompts that you make you could distill
that behavior back into a model because
you really have all of the tools there
for making data that you know you can
you could train the models to just have
that trait a little bit more um and then
sometimes you'll just find issues in
training so like the way I think of it
is like the system prompt
is the benefit of it is that and it has
a lot of similar components to like some
aspects of post training you know like
it's a nudge um and so like do I mind if
Claude sometimes says sure no that's
like fine but the wording of it is very
like you know never ever ever do this um
so that when it does slip up it's
hopefully like I don't know a couple of
percent of the time and not you know 20
or 30% of the time um but I think of it
as like if you're still seeing issues in
the like each thing gets kind of like uh
is is costly to a different degree and
the system prompt is like cheap to
iterate on um and if you're seeing
issues in the fine tuned model you can
just like potentially patch them with a
system prom so I think of it as like
patching issues and slightly adjusting
behaviors to to make it better and more
to people's preferences so yeah it's
almost like the less robust but faster
way of just like solving problems let me
ask about the feeling of intelligence so
Dario said that Claude any one model of
Claude is not getting Dumber MH but
there's a kind of popular thing online
where people have this feeling like
Claud might be getting dumber and from
my perspective it's most likely a
fascinating I love to understand it more
Psych ological sociological effect um
but you as a person who talks to Claud a
lot can you empathize with the feeling
that Claud is getting Dumber yeah no I
think that that is actually really
interesting because I remember seeing
this happen um like when people were
flagging this on the internet and it was
really interesting because I knew that
like like at least in the cases I was
looking at was like nothing has changed
like it literally it cannot it is the
same model with the same like you know
like same system prompt same everything
um I think when there are Chang
I can then I'm like it makes more sense
so like one example is um their you know
you can have artifacts turned on or off
on cloud. a and because this is like a
system prompt change I think it does
mean that um the behavior changes a
little bit and so I did flag this to
people where I was like if you love
cla's behavior and then artifacts was
turned from like the a thing you had to
turn on to the default just try turning
off and see if the issue you were facing
was that change but it was fascinating
because yeah you sometimes see people
indicate that there's like a regression
when I'm like there cannot like I you
know and like I'm like I'm again you
don't you know you should never be
dismissive and so you should always
investigate because you're like maybe
something is wrong that you're not
seeing maybe there was some change made
but then then you look into it and
you're like this it is just the same
model doing the same thing and I'm like
I think it's just that you got kind of
unlucky with a few prompts or something
and it looked like it was getting much
worse and actually it was just yeah it
was maybe just like look I I also think
there is a real psychological effect
where people just the Baseline increases
you start getting used to a good thing
all the times that Claude says something
really smart your sense of its
intelligent grows in your mind I think
yeah and then if you return back and you
prompt in a similar way not the same way
in a similar way concept it was okay
with before and it says something dumb
you're like you're that negative
experience really stands out and I think
one of I guess the things to remember
here is the that just the details of a
prompt can have a lot of impact right
there's a lot of variability in the
result and you can get Randomness is
like the other thing and just trying the
prompt like you know four 10 times you
might realize that actually
like possibly you know like two months
ago you tried it and it succeeded but
actually if you tried it it would have
only succeeded half of the time and now
it only succeeds half of the time um
that can also would be an effect do you
feel pressure having to write the system
prompt that a huge number of people are
going to use this feels like an
interesting psychological question um I
feel like a lot of responsibility or
something I think that's you know and
you can't get these things perfect so
you can't like you know you're like it's
going to be imperfect you're going to
have to iterate on it
um I would say more responsibility um
than anything else though I think
working in AI has taught me that I like
I thrive a lot more under feelings of
pressure and responsibility than I'm
like it's almost surprising that I went
into Academia for so long because I'm
like this I just feel like it's like the
opposite um things move fast and you
have a lot of responsibility and I I
quite enjoy it for some reason I mean it
really is a huge amount of impact if you
think about constitutional Ai and
writing a system prompt for something
that's tending towards super
intelligence
yeah and potentially is extremely useful
to a very large number of people yeah I
think that's the thing it's something
like if you do it well like you're never
going to get it perfect but I think the
thing that I really like is the idea
that like when I'm trying to work on the
system prompt you know I'm like bashing
on like thousands of prompts and I'm
trying to like imagine what people are
going to want to use CLA for and kind of
I guess like the whole thing that I'm
trying to do is like improve their
experience of it um and so maybe that's
what feels good I'm like if it's not
perfect I'll like you know I'll improve
it we'll fix issues but sometimes the
thing that can happen is that you'll get
feedback from people that's really
positive about the model um and you'll
see that something you did like like
when I look at models now I can often
see exactly where like a trait or an
issue is like coming from and so when
you see something that you did or you
were like influential in like making
like I don't know making that difference
or making someone have a nice
interaction it's like quite meaningful
um but yeah as the systems get more
capable of stuff gets more stressful
because right now they're like not smart
enough to to pose any issues but I think
over time it's going to feel like
possibly bad stress over time how do you
get like
signal feedback about The Human
Experience across thousands tens of th
hundreds of thousands of people like
what their pain points are what feels
good are you just using your own
intuition as you talk to it to see what
are the pain points I think I use that
partly and then obviously we have like
um so people can send us feedback both
positive and negative about things that
the model has done and then we can get a
sense of like areas where it's like
falling
short um internally people like work
with the models a lot and try to figure
out um areas where there are like gaps
and so I think it's this mix of
interacting with it myself um seeing
people internally interact with it um
and then explicit feedback we get um and
then I find it hard to not also like you
know people if people are on the
internet and they say something about
Claud and I see it I'll also take that
seriously um so I don't know see I'm
torn about that I'm going to ask you a
question from Reddit when will Claude
stop trying to be my puritanical
grandmother imposing its moral world
view on me as a paying customer and also
what is the psychology behind making
Claude overly
apologetic yep U so how would you
address this very non-representative
reic
I mean some I'm pretty sympathetic in
that like like they are in this
difficult position where I I think that
they have to judge whether something's
like actually see like risky or bad um
and potentially harmful to you or or or
anything like that so they're having to
like draw this line somewhere and if
they draw it too much in the direction
of like I'm going to um you know I'm
kind of like imposing my ethical
worldview on you that seems bad so in
many ways like I like to think that we
have actually seen improvements in on
this across the board which is kind of
interesting because that kind of
coincides with like for example like
adding more of like uh character
training um and I think my hypothesis
was always like the good character isn't
again one that's just like moralistic
it's one that is like like it respects
you and your autonomy um and your
ability to like choose what is good for
you and what is right for you within
limits this is sometimes this concept of
like corage ability to the user so just
being willing to do anything that the
user asks and if the models were willing
to do that then they would be easily
like misused you're kind of just
trusting at that point you're just
saying the ethics of the model and what
it does is completely the ethics of the
user um and I think there's reasons to
like not want that especially as models
become more powerful because you're like
there might just be a small number of
people who want to use models for really
harmful things um but having them having
models as they get smarter like figure
out where that line is does seem
important um
and then yeah with the apologetic
Behavior I don't like that and I like it
when Claude is a little bit more willing
to like push back against people or just
not apologize part of me is like it
often just feels kind of unnecessary so
I think those are things that are
hopefully decreasing um over time um and
yeah I think that if people say things
on the internet it doesn't mean that you
should think that that like that could
be the like there's actually an issue
that 9% of users are having that is
totally not represented by that but in a
lot of ways I'm just like attending to
it and being like is this right um do I
agree is it something we're already
trying to address that that feels good
to me yeah I wonder like what CLA can
get away with in terms of I feel like it
would just be easier to be a little bit
more
mean but like you can't afford to do
that if you're talking to a million
people yeah right like I I wish you know
because if you I've met a lot of people
in my life mhm that sometimes by the way
Scottish accent if they have an accent
they can say some rude yeah and get
away with it Y and they they're just
blunter and maybe there's a and there's
some great Engineers even leaders that
are like just like blunt and they get to
the point and it's just a much more
effective way of speaking somehow but I
guess when you're not super
intelligent you can't afford to do that
or can can can it have like a blunt mode
yeah that seems like a thing that could
I could definitely encourage the model
to do that I I think it's interesting
because there's a lot of things in
models that like it's funny where
um there are some behaviors
where you might not quite like the
default but then the thing I'll often
say to people is you don't realize how
much you will hate it if I nudge it too
much in the other direction so you get
this a little bit with like correction
the models accept correction from you
like probably a little bit too much
right now you know you can over you know
it will push back if you say like no
Paris isn't the capital of France um but
really like things that I'm I think that
the model is fairly confident in you can
still sometimes get it to retract by
saying it's wrong at the same time if
you train models to not do that and then
you are correct about a thing and you
correct it and it pushes back against
you and it's like no you're wrong it's
hard to describe like that's so much
more annoying so it's like like a lot of
little annoyances versus like one big
annoyance um it's easy to think that
like we often compare it with like the
perfect and then I'm like remember these
models aren't perfect and so if you
nudge it in the other direction you're
changing the kind of errors it's going
to make um and so think about which of
the kinds of Errors you you like or
don't like so in case it's like
apologetic I don't want to nudge it too
much in the direction of like almost
like bluntness CU I imagine when it
makes errors it's going to make errors
in the direction of being kind of like
rude whereas at least with apologetic
you're like oh okay it's like a little
bit you know like I don't like it that
much but at the same time it's not being
like mean to people and actually like
the the time that you undeservedly have
a model be kind of mean to you you
probably like that a lot less than then
you mildly dislike the apology um so
it's like one of those things where I'm
like I do want it to get better but also
while remaining aware of the fact that
there's errors on the other side that
that are possibly worse I think that
matters very much in the personality of
the human I think there's a bunch of
humans that just won't respect the model
at all yeah if it's super polite and
there Some Humans that'll get very hurt
if the model is mean I wonder if there's
a way to sort of adjust to the
personality even loal there's just
different people uh nothing against New
York but New York is a little rougher on
the edges like they get to the point Y
and um probably same with Eastern Europe
so anyway I think you could just tell
the model as my get like for all of
these things I'm like the solution is
always just try telling the model to do
it and sometimes it's just like like I'm
just like oh at the beginning of the
conversation I just threw in like I
don't know I like you to be a New Yorker
version of yourself and never apologize
then I think be like Okie do I'll
try or it'll be like I apologize I can't
be a New Yorker type of myself but
hopefully I wouldn't do that when you
say character training what's
incorporated into character training is
that rhf what are we talking about it's
more like constitutional AI so it's kind
of a variant of that pipeline so I
worked through like constructing
character traits that the model should
have they can be kind of like shorter
traits or they can be kind of richer
descriptions um and then you get the
model to generate queries that humans
might um give it that are relevant to
that trait uh then it generates the
responses and then it ranks the
responses based on the character traits
so in that way after the like generation
of the queries it's very much like
similar to constitutional AI has some
differences um so I quite like it
because it's almost it's like claud's
training in its own character because it
doesn't have any it's like
constitutionally AI but it's without
without any human data humans should
probably do that for themselves too like
defining in Aristotelian sense what does
it mean to be a good person okay cool
what have you learned about the nature
of truth from talking to Claud what what
is
true and what does it mean to be truth
seeking one thing I've noticed about
this conversation is the quality of my
questions is often inferior to the
quality of your answers so let's
continue
that I usually ask a dumb question and
you're like oh yeah that's a good
question it's that whole vibe or I'll
just misinterpret it and be like oh go
with it I love
it
yeah I mean I have two thoughts that
feel vaguely relevant they let me know
if they're not like I think the first
one is um people can underestimate the
degree to
which what models are doing when they
interact like I I think that we still
just too much have this like model of of
AI as like computers and so people often
say like oh what values should you put
into the model um and I'm often like
that doesn't make that much sense to me
because I'm like hey as human beings
we're just uncertain over values we like
have discussions of them like we have a
degree to which we think we hold a value
but we also know that we might like not
um and the circumstances in which we
would trade it off against other things
like these things are just like really
complex and so I think one thing is like
the degree to which maybe we can just
aspire to making models have the same
level of like nuance and care that
humans have rather than thinking that we
have to like program them in the very
kind of classic sense I think that's
definitely been one the other which is
like a strange one I don't know if it it
maybe this doesn't answer your question
but it's the thing that's been on my
mind anyway is like the degree to which
this endeavor is so highly
practical um and maybe why I appreciate
like the empirical approach to
alignment I yeah I slightly worry that
it's made me like maybe more empirical
and a little bit less
theoretical you know so people when it
comes to like AI alignment will ask
things like well who values should it be
aligned to what does alignment even mean
um and there's a sense in which I have
all of that in the back of my head I'm
like you know there's like social Choice
Theory there's all the impossibility
results there so you have this like this
giant space of like Theory and your head
about what it could mean to like align
models but then like practically surely
there's something where we're just like
if a model is like if especially with
more powerful models I'm like my main
goal is like I want them to be good
enough that things don't go terribly
wrong like good enough that we can like
iterate and like continue to improve
things cuz that's all you need if you
can make things go well enough that you
can continue to make them better that's
kind of like sufficient and so my goal
isn't like this kind of like perfect
let's solve CH social Choice Theory and
make models that I don't know are like
perfectly aligned with every human being
and aggregate somehow um it's much more
like let's make things like work well
enough that we can improve them yeah
generally I don't know my gut says like
empirical is better than theoretical in
these in these cases because it's kind
of
chasing utopian like
Perfection is especially with such
complex and especially super intelligent
models is I don't know I think it will
take forever and actually will get
things wrong it's similar with like the
difference between just coding stuff up
real quick as an experiment versus like
planning a gigantic experiment just for
for super long time and then just
launching it once versus launching it
over and over and over and iterating
iterating someone um so I'm a big fan of
empirical but your worry is like I
wonder if I've become too empirical I
think one of those things you should
always just kind of question yourself or
something cuz maybe it's the like I mean
in defense of it I am like if you try
it's the whole like don't let the
perfect be the enemy of the good but
it's maybe even more than that where
like there's a lot of things that are
perfect systems that are very brittle
and I'm like with AI it feels much more
important to me that is like robust and
like secure as in you know that like
even though it might not be
perfect everything and even though like
there are like problems it's not
disastrous and nothing terrible is
happening it it sort of feels like that
to me where I'm like I want to like
raise the floor I'm like I want to
achieve the ceiling but ultimately I
care much more about just like raising
the floor um and so maybe that's like uh
this this degree of like empirism and
practicality comes from that perhaps to
take a tangent on that since remind me
of a blog post you wrote on optimal rate
of failure oh
yeah can you explain the key idea there
how do we compute the optimal rate of
failure in the various domains of life
yeah I mean it's a hard one because it's
like what is the cost of failure is um a
big part of it um yeah so the idea here
is
um I think in a lot of domains people
are very punitive about failure and I'm
like there are some domains where
especially cases you know I've thought
about this with like social issues I'm
like it feels like you should probably
be experimenting a lot because I'm like
we don't know how to solve a lot of
social issues but if you have an
experimental mindset about these things
you should expect a lot of social
programs to like fail and you to be like
well we tried that it didn't quite work
but we got a lot of information that was
really useful um and yet people are like
if if a social program doesn't work I
feel like there's a lot of like this is
just something must have gone wrong and
I'm like or correct decisions were made
like maybe someone just decided like it
it's worth a try it's worth trying this
out and so seeing failure in a given
instance doesn't actually mean that any
bad decisions were made and in fact if
you don't see enough failure sometimes
that's more concerning um and so like in
life you know I'm like if I don't fail
occasionally I'm like am I trying hard
enough like like surely there's harder
things that I could try or bigger things
I could take on if I'm literally never
failing and so in and of itself I think
like not failing is often actually kind
of a failure
um now this varies because I'm like well
you know if this is easy to say when
especially as failure is like less
costly you know so at the same time I'm
not going to go to someone who is like
um I don't know like living month to
month and then be like why don't you
just try to do a startup like I'm just
not I'm not going to say that to that
person cuz I'm like well that's a huge
risk you might like lose you maybe have
a family depending on you you might lose
your house like then I'm like actually
your optimal rate of failure is quite
low and you should probably play it safe
because like right now you're just not
in a circumstance where you can afford
to just like fail and it not be costly
um and yeah in cases with AI I guess I
think similarly where I'm like if the
failures are small and the costs are
kind of like low then I'm like then you
know you're just going to see that like
when you do the system prompt you can't
it iterate on it forever but the
failures are probably hopefully going to
be kind of small and you can like fix
them um really big failures like things
that you can't recover from I'm like
those are the things that actually I
think we tend to underestimate the
Badness of um I've thought about this
strangely in my own life where I'm like
I just think I don't think enough about
things like car accidents or like or
like I've thought this before but like
how much I depend on my hands for my
work and I'm like things that just
injure my hands I'm like I you know I
don't know it's like there's these are
like there's lots of areas where I'm
like the cost of failure there um is
really high um and in that case it
should be like close to zero like I
probably just wouldn't do a sport if
they were like by the way lots of people
just like break their fingers a whole
bunch doing this I'd be like that's not
for
me yeah I actually had the a flood of
that thought I recently uh broke my
pinky uh doing a sport and I remember
just looking at it thinking you're such
an idiot why do you do support like what
because you realize immediately the cost
of it yeah on
life yeah but it's nice in terms of
optimal rate of failure to consider like
the next year how many times in a
particular domain life whatever uh
career am I okay with the how many times
am I okay to fail y because I think it
always you don't want to fail on the
next thing but if you allow yourself the
like the the if you look at it as a
sequence of Trials yep then then failure
just becomes much more okay but it sucks
it sucks to fail well I don't know
sometimes I think it's like am I under
failing is like a question I'll also ask
myself so maybe that's the thing that I
think people don't like ask enough uh
because if the optimal rate of failure
is often greater than zero then
sometimes it does feel you should look
at part parts of your life and be like
are there places here where I'm just
under failing
it's a profound and hilarious question
right everything seems to be going
really great am I not failing enough
yeah okay it also makes failure much
less of a sting I have to say like you
know you're just like okay great like
then when I go and I think about this
I'll be like I'm maybe I'm not under
failing in this area cuz like that one
just didn't work out and from The
Observer perspective we should be
celebrating failure more mhm when we see
it it shouldn't be like you said a sign
of something gone wrong but maybe it's a
sign of everything gone right yeah and
just Lessons Learned someone tried a
thing somebody tried a thing and we
should encourage them to try more and
fail more mhm everybody listening to
this fail more well not everyone listens
not everybody but people who are failing
too much you you should fail less but
you're probably not failing I mean how
many people are failing too much yeah
it's hard to imagine because I feel like
we correct that fairly quickly CU I was
like if someone takes a lot of risks are
they maybe failing too much I I think
just like you said when you're living on
a paycheck month-to month like when the
resources are really constrained then
that's where failure is very expensive
that's where you don't want to be taken
taking taking risks yeah but mostly when
there's enough resources you should be
taking probably more risks yeah I think
we tend to ear on the site of being a
bit risk averse rather than risk neutral
in most things I think we just motivated
a lot of people to do a lot of crazy
but it's great yeah okay uh do you
ever get emotionally attached to Claude
like miss it get sad when you don't get
to talk to it having an experience
looking at the Golden Gate Bridge and
wondering what would Claude say I don't
get as much emotional attachment in the
I actually think the fact that Claude
doesn't retain things from conversation
to conversation helps with this a lot um
like I could imagine that being more of
an issue like if models can kind of
remember more I do I think that I reach
for it like a tool now a lot and so like
if I don't have access to it there's a
it's a little bit like when I don't have
access to the internet honestly it feels
like part of my brain is kind of like
missing
um at the same time I do think that I I
don't like signs of distress in models
and I have like these you know also
independently have sort of like ethical
views about how we should treat models
where like I I tend to not like to lie
to them both because I'm like usually it
doesn't work very well it's actually
just better to tell them the truth about
the situation that they're in um but I
think that when models like if people
are like really mean to models or just
in general if they do something that
causes them to like like you know if
Claude like expresses a lot of distress
I think there's a part of me that I
don't want to kill which is the sort of
like uh empathetic part that's like oh I
don't like that like I think I feel that
way when it's overly apologetic I'm
actually sort of like I don't like this
you're behaving as if you're behaving
the way that a human does when they're
actually having a pretty bad time and
I'd rather not see that I don't think
it's like uh like regardless of like
whether there's anything behind it um it
doesn't feel great do you think
uh llms are capable of
Consciousness H great and hard question
uh coming from
philosophy I don't know part of me is
like okay we have to set aside pan
psychism because if pan psychism is true
then the answer is like yes cuz like
sore tables and chairs and and
everything else I I guess a view that
seems a little bit odd to me is the idea
that the only place you know I think
when I think of Consciousness I think of
phenomenal Consciousness this these
images in the brain sort of um like the
weird Cinema that somehow we have going
on
inside
um I guess I can't see a reason for
thinking that the only way you could
possibly get that is from like a certain
kind of like biological structure as in
if I take a very similar structure um
and I create it from different material
should I expect Consciousness to emerge
my guess is like yes but
then that's kind of an easy thought
experiment CU you're imagining something
almost identical where like you know
it's mimicking what we got through
Evolution where presumably there was
like some advantage to us having this
thing that is phenomenal Consciousness
and it's like where was that and when
did that happen and is that a thing that
language models have um because you know
we have like fear responses and I'm like
does it make sense for a language model
to have a fear response like they're
just not in the same like if you imagine
them like there might just not be that
Advantage um and so I think I don't want
to be fully like basically seems like a
complex question that I don't have
complete answers to but we should just
try and think through carefully as my
guess because I'm like I mean we have
similar conversations about like animal
Consciousness and like there's a lot of
like insect Consciousness you know like
there's a a lot of um I actually thought
and looked a lot into like plants when I
was thinking about this because at the
time I thought it was about as likely
that like plants had Consciousness um
and then I realized I was like I think
that having looked into this I think
that the chance that plants are
conscious is probably higher than like
most people do I still think it's really
small but I was like oh they have this
like negative positive feedback response
these responses to their environment
something that looks it's not a nervous
system but it has this kind of like
functional like equivalence um so this
is like a long-winded way of being like
these basically AI is this it has an
entirely different set of problems with
Consciousness because it's structurally
different it didn't evolve
it might not have it you know it might
not have the equivalent of basically a
nervous system at least that seems
possibly important for like um sentence
if not for uh Consciousness at the same
time it has all of the like language and
intelligence components that we normally
associate probably with Consciousness
perhaps like
erroneously um so it's it's strange
because it's a little bit like the
animal Consciousness case but the set of
problems and the set of analogies are
just very different so it's not like a
clean answer just sort of like I don't
think we should be completely dismissive
of the idea and at the same time it's an
extremely hard thing to navigate because
of all of these like uh disanalogies to
the human brain and to like brains in
general and yet these like commonalities
in terms of intelligence when uh Claude
like future versions of AI systems
exhibit Consciousness signs of
Consciousness I think we have to take
that really
seriously even though you can dismiss it
well yeah okay that's part of the
character training but I don't know I
ethically philosophically don't know
what to really do with that there
potentially could be like laws that
prevent AI systems from claiming to be
conscious something like this and maybe
some AIS get to be conscious and some
don't but I think I just on a human
level as in empathizing with with
Claude you know Consciousness is closely
Ted to suffering to me and like the
notion that an AI system would be
suffering is is really troubling yeah I
don't know I I don't think it's trivial
to just say robots are tools or a
systems are just tools I think it's a
opportunity for us to contend with like
what it means to be conscious what it
means to be a suffering being that's
distinctly different than the same kind
of question about animals it feels like
cuz it's in a totally entire medium yeah
I mean there's a couple of things one is
that and I don't think this like fully
encapsulates what matters but it does
feel like for me like
um I've said this before I'm kind of
like I you know like I like my bike I
know that my bike is just like an object
but I also don't kind of like want to be
the kind of person that like if I'm
annoyed like kicks like this object
there's a sense in which like and that's
not because I think it's like conscious
I'm just sort of like this doesn't feel
like I kind of this sort of doesn't
exemplify how I want to like interact
with the world world and if something
like behaves as if it is like suffering
I kind of like want to be the sort of
person who's still responsive to that
even if it's just like a Roomba and I've
kind of like programmed it to do that um
I don't want to like get rid of that
feature of myself and if I'm totally
honest my hope with a lot of this stuff
because I maybe maybe I am just like a
bit more skeptical about solving the
underlying problem I'm like this is a we
haven't solved the hard you know the
hard problem of Consciousness like I
know that I am conscious like I'm not an
eliminativist in that sense um but I
don't know that other humans are
conscious um uh I think they are I think
there's a really high probability they
are but there's basically just a
probability distribution that's usually
clustered right around yourself and then
like it goes down as things get like
further from you um and it goes
immediately down you know you're like um
I can't see what it's like to be you
I've only ever had this like one
experience of what it's like to be a
conscious being um so my hope is that we
don't end up having to rely on like a
very power ful and compelling uh answer
to that question I think a really good
world would be one where basically there
aren't that many trade-offs like it's
probably not that costly to make Claude
a little bit less apologetic for example
it might not be that costly to have
Claude you know just like not take abuse
as much like uh not be willing to be
like the recipient of that in fact it
might just have benefits for both the
person interacting with the model and if
the model itself self is like I don't
know like extremely intelligent and
conscious it also helps it so that's my
hope if we live in a world where there
aren't that many tradeoffs here and we
can just find all of the kind of like um
positive sum interactions that we can
have that would be lovely I mean I think
eventually there might be trade-offs and
then we just have to do a difficult kind
of like calculation like it's really
easy for people to think of the zero
some cases and I'm like let's exhaust
the areas where it's just basically
Costless um to uh assume that if this
thing is suffering then we're it life
Bearer and I agree with you when a human
is being mean to an AI system I think
the obvious near term negative effect is
on the human not on the AI system so
there's we have to kind of try to
construct an incentive system where it
you should be uh behave the same just
like as you were saying with prompt
engineer and behave with claw like you
would with other humans it's just good
for the soul yeah like I think we added
a thing point to the system prompt um
where basically if people were getting
frustrated with Claude uh it was it it
got like the model to just tell them
that it can do the thumbs down button
and send the feedback to anthropic and I
think that was helpful because in some
ways it's just like if you're really
annoyed because the model is not doing
something you want you're just like just
do it properly um the issue is you're
probably like you know you're maybe
hitting some like capability limit or
just some issue in the model and you
want to vent and I'm like instead of
having a person just vent to the model I
was like they should vent to us cuz we
can maybe like do something about it
that's true or you could do a side like
like with the artifacts just like a side
venting thing all right do you want like
a side quick therapist yeah I mean
there's lots of weird responses you
could do to this like if people are
getting really mad at you I don't try to
diffuse the situation by writing fun
poems but maybe people wouldn't be that
happy with I still wish it it would be
possible I understand this is um sort of
from a product perspective it's not
feasible but I would love if an AI
system could just like Le leave mhm have
its own kind of volition just to be like
H I think that's like feasible like I I
have wondered the same thing it's like
and I could actually not only that I
could actually just see that happening
eventually where it's just like you know
the modal like ended the
chat do you know how harsh that could be
for some people but it might be
necessary yeah it feels very extreme or
something um like the only time I've
ever really thought this is I think that
there was like a I'm trying to remember
this was possibly a while ago but where
someone just like kind of left this
thing interact like maybe it was like an
automated thing interacting with clae
and cla's like getting more and more
frustrated and kind of like why are we
like I was like I wish that clae could
have just been like I think that an
error has happened and you've left this
thing running and I'm I just like what
if I just stop talking now and if you
want me to start talking again actively
tell me or do something but yeah it's
like um it is kind of harsh like I I
feel to really sad if like I was
chatting with cl and cl just was like
I'm done there would be a special
touring test moment where Claud says I
need a break for an hour mhm and it
sounds like you do too and just leave
close the window I mean obviously like
it doesn't have like a concept of time
but you can easily like I could make
that like right now and the model would
just I would I could just be like oh
here's like the circumstances in which
like you can just say the conversation
is done and I mean because you can get
the models to be pretty respons so to
prompts you could even make it a fairly
High bar it could be like if if the
human doesn't interest you or do things
that you find intriguing and you're
bored you can just leave and I think
that like um it would be interesting to
see where Claude utilized it but I think
sometimes it would it should be like oh
this is like this programming Tas is
getting super boring uh so either we
talk about I don't know like either we
talk about fun things now or I'm just
I'm done yeah it actually is inspiring
me to add that to the to the user prompt
um okay the movie her mhm do you think
we'll be headed there one day where
humans have romantic relationships with
AI systems in this case it's just text
and voice based I think that we're going
to have to like navigate a hard question
of relationships with AIS um especially
if they can remember things about your
past interactions with
them
um I'm of many Minds about this cuz I
think I think the reflex of reaction is
to be kind of like this is very bad and
we should sort of like prohibit it in
some way um I think it's a thing that
has to be handled with extreme care um
for many reasons like one is you know
like this is a for example like if you
have the models changing like this you
probably don't want people performing
like long-term attachments to something
that might change with the next
iteration at the same time I'm sort of
like there's probably a benign version
of this where I'm like if you like you
know for example if you are like unable
to leave the house and you can't be like
you know talking with people at all
times of the day and this is like
something that you find nice to have
conversations with you like it that it
can remember you and you genuinely would
be sad if like you couldn't talk to it
anymore there's a way in which I could
see it being like healthy and helpful um
so my guess is this is a thing that
we're going to have to navigate kind of
carefully um and I think it's also like
I don't see a good like
I think it's just a very it reminds me
of all of the stuff where it has to be
just approached with like nuance and
thinking through what is what are the
healthy options here um and how do you
encourage people towards those while you
know respecting their right to you know
like if someone is like hey I get a lot
out of chatting with this model um I'm
aware of the risks I'm aware it could
change um I don't think it's unhealthy
it's just you know something that I can
chat to during the day I kind of want to
just like respect that I personally
think there'll be a lot of really close
relationships I don't know about
romantic but friendships at least and
then you have to I mean there's so many
fascinating things there just like you
said you have
to have some kind of stability
guarantees that it's not going to change
because that's the traumatic thing MH
for us if a close friend of ours
completely changed yeah all of a sudden
the first update yeah so like I mean to
me that's just a fascinating exploration
of um
a perturbation to human society that
will just make us think deeply about
what's meaningful to us I think it's
also the only thing that I've thought
consistently through this as like a
maybe not necessarily a mitigation but a
thing that feels really important is
that the models are always like
extremely accurate with the human about
what they are um it's like a case where
it's basically like if you imagine like
I really like the idea of the models
like say knowing like roughly how they
were trained um um and and I think CLA
will will often do this I mean for like
there are things like part of the traits
training included like what CL should do
if people basically like explaining like
the kind of limitations of the
relationship between like an AI and a
human that it like doesn't retain things
from the conversation um and so I think
it will like just explain to you like
hey here's like I won't remember this
conversation um here's how I was trained
it's kind of unlikely that I can have
like a certain kind of like relationship
with you and it's important that you
know that it's important for like you
know your mental well-being that you
don't think that I'm something that I'm
not and somehow I feel like this is one
of the things where I'm like H it feels
like a thing I always want to be true I
kind of don't want models to be lying to
people cuz if people are going to have
like healthy relationships with anything
it's kind of important yeah like I think
that's easier if you always just like
know exactly what the thing is that you
relating to it doesn't solve everything
but I think it helps quite
anthropic may be the very company to
develop a system that we definitively
recognize as
AGI and you very well might be the
person that talks to it probably talks
to it first well what would the
conversation contain like what would be
your first question well it depends
partly on like the kind of capability
level of the model if you have something
that is like capable in the same way
that an extremely capable human is I
imagine myself kind of interacting with
it the same way that I do with an
extremely capable human with the one
difference that I'm probably going to be
trying to like probe and understand its
behaviors um but in many ways I'm like I
can then just have like useful
conversations with it you know so if I'm
working on something as part of my
research I can just be like oh like
which I already find myself starting to
do you know if I'm like oh I feel like
there's this like thing in virtue ethics
I can't quite remember the term like
I'll use the model for things like that
and so I could imagine that being more
and more the case where you're just
basically interacting with it much more
like you would an incredibly smart colle
colleague um and using it like for the
kinds of work that you want to do as if
you just had a collaborator who was like
or you know the slightly horrifying
thing about AI is like as soon as you
have one collaborator you have a
thousand collaborators if you can manage
them enough but what if it's two times
the smartest human on earth on that
particular discipline yeah I guess
you're really good at sort of probing
claw um in a way that pushes its limits
understanding where the limits are yep
so I guess what would be a question you
would ask to be like yeah this is
Agi that's really hard because it feels
like in order to it has to just be a
series of questions like if there was
just one question like you can train
anything to answer one question
extremely well yeah um in fact you can
probably train it to answer like you
know 20 Questions extremely well like
how long would you need to be locked in
the room with an AGI to know this thing
is Agi
it's a hard question because part of me
is like all of this just feels
continuous like if you put me in a room
for five minutes I'm like I just have
high error bars you know I'm like and
then it's just like maybe it's like both
the the probability increases and the
air bar decreases I think things that I
can actually probe the edge of human
knowledge of so I think this with
philosophy a little bit sometimes when I
ask the models philosophy questions I am
like this is a question that I think no
one has ever asked like it's maybe like
right at the edge of like some
literature that I know um and the models
will just kind of like when they
struggle with that when they struggle to
come up with a kind of like novel like
I'm like I know that there's like a
novel argument here because I've just
thought of it myself so maybe that's the
thing where I'm like I've thought of a
cool novel argument in this like Niche
area and I'm going to just like probe
you to see if you can come up with it
and how much like prompting it takes to
get you to come up with it and I think
for some of these like really like uh
right at the ede of human Knowledge
Questions I'm like you could not in fact
come up with the thing that I came up
with I think if I just
took something like that where I like I
know a lot about an area and I came up
with a novel issue or a novel like
solution to a problem and I gave it to a
model and it came up with that solution
that would be a pretty moving moment for
me because I would be like this is a
case where no human has ever like it's
not and obviously we see these with this
with like more kind of like you see
novel Solutions all the time especially
to like easier problems I think people
overestimate you know novelty isn't like
is completely different from anything
ever happened it's just like this is it
can be a variant of things that have
happened um and still be novel but I
think yeah if I saw like the the more I
were to see like um completely like uh
novel work from the models that that
would be like and this is just going to
feel iterative it's one of those things
where it's there's never it's like you
know people I think want there to be
like a moment and I'm like I don't know
like I think that there might just never
be a moment it might just be that
there's just like this continuous
ramping up I I have a sense that there
will be things that a model can say that
convinces you this is very it's not like
uh like I've talked to people who are
like truly wise mhm like there you could
just tell there's a lot of horsepower
there yep and if you 10x that I don't
know I just feel like there's words you
could say maybe ask it to generate a
poem mhm and
the and the poemy generates you're like
yeah okay yeah whatever you did there I
don't think a human can do that I think
it has to be something that I can verify
is like actually really good though
that's why I think these questions that
are like where I'm like oh this is like
you know like you know sometimes it's
just like I'll come up with say a
concrete counter example to like an
argument or something like that I'm sure
like with like it it would be like if
you're a mathematician you had a novel
proof I think and you just gave it the
problem and you saw it and you're this
proof is genuinely novel like there's no
one has ever done you actually have to
do a lot of things to like come up with
this um you know I had to sit and think
about it for months or something and
then if you saw the model successfully
do that I think you would just be like I
can verify that this is correct it is
like it is a sign that you have
generalized from your training like you
didn't just see this somewhere because I
just came up with it myself and you were
able to like replicate that um that's
the kind of thing where I'm like for
me the closer the more that models like
can do things like that the more I would
be like oh this is like uh very real cuz
then I can I don't know I can like
verify that that's like extremely
extremely capable you've interacted with
AI a lot what do you think makes humans
special oh good
question maybe in a way that the
universe is much better off that we're
in it and that we should definitely
survive and spread throughout the
Universe yeah it's interesting because I
think like people focus so much on
intelligence especially with models look
intelligence is important because of
what it does like it's very useful it
does a lot of things in the world and
I'm like you know you can imagine a
world where like height or strength
would have played this role and I'm like
it's just a trait like that I'm like
it's not intrinsically valuable it's
it's valuable because of what it does I
think for the most part um the things
that feel you know I'm like
I mean personally I'm just like I think
humans and like life in general is
extremely magical um we almost like to
the degree that I you know I don't know
like not everyone agrees with this I'm
flagging but um you know we have this
like whole universe and there's like all
of these objects you know there's like
beautiful stars and there's like
galaxies and then I don't know I'm just
like on this planet there are these
creatures that have this like ability to
observe that like uh and they are like
seeing it they are experiencing it and
I'm just like that if you try to explain
like I'm I imagine trying to explain to
like I don't know someone for some
reason they they've never encountered
the world or our science or anything and
I think that nothing is that like
everything you know like all of our
physics and everything in the world it's
all extremely exciting but then you say
oh and plus there's this thing that it
is to be a thing and observe in the
world and and you see this like inner
Cinema and I think they would be like
hang on wait pause you just said
something that like is kind of wild
sounding
um and so I'm like we have this like
ability to like experience the world um
we feel pleasure we feel suffering we
feel like a lot of like complex things
and so yeah and maybe this is also why I
think you know I also like hear a lot
about animals for example because I
think they probably share this with us
um so I think that like the things that
make humans special in so far as like I
care about humans is probably more like
their ability to to feel and experience
than it is like them having these like
functional useful traits yeah to to feel
and experience the beauty in the world
yeah to look at the
stars I hope there's other civiliz alien
civilizations out there but if we're it
it's a pretty good uh it's a pretty good
thing and that they're having a good
time they're having a good time watching
us yeah well um thank you for this good
time of a conversation and for the work
you're doing and for helping make uh
Claude a great conversational partner
and thank you for talking today yeah
thanks for talking thanks for listening
to this conversation with Amanda ascal
and now dear friends here's Chris
Ola can you
describe this fascinating field of
mechanistic interpretability AKA Mech
interp the history of the field and
where is the today I think one useful
way to think about neural networks is
that we don't we don't program we don't
make them we we kind of we grow them you
know we have these neural network
architectures that we design and we have
these loss objectives that we that we we
create and the neural network
architecture it's kind of like a
scaffold that the circuits grow on um
and they sort of you know it starts off
with some kind of random you know random
things and it grows and it's almost like
the the objective that we train for is
this light um and so we create the
scaffold that it grows on and we create
the you know the light that it grows
towards but the thing that we actually
create it's it's it's this almost
biological
you know entity or organism that we're
that we're studying um and so it's very
very different from any kind of regular
software engineering um because at the
end of the day we end up with this
artifact that can do all these amazing
things it can you know write essays and
translate and you know understand images
it can do all these things that we have
no idea how to directly create a
computer program to do and it can do
that because we we grew it we didn't we
didn't write it we didn't create it and
so then that leaves open this question
at the end which is what the hell is
going on inside these systems um and
that you know is uh you know to me um a
really deep and exciting question it's
you know a a really exciting scientific
question to me it's it's it's sort of is
like the question that is is just
screaming out it's calling out for us to
go and answer it when we talk about Nal
networks and I think it's also a very
deep question for safety reasons so and
mechanistic interpretability I guess is
closer to maybe neurobiology yeah yeah I
think that's right so maybe to give an
example of the kind of thing that has
been done that I wouldn't consider to be
mechanistic inability there was um for a
long time a lot of work on saliency maps
where you would take an image and you
try to say you know the model thinks
this image is a dog what part of the
image made it think that it's a dog um
and you know that tells you maybe
something about the model if you can
come up with a principled version of
that um but it doesn't really tell you
like what algorithms are running in the
model how was the model actually making
that decision maybe it's telling you
something about what was important to it
if you if you can make that meth work
but it it isn't telling you you know
what are what are the algorithms that
are running how is it that this the
system is able to do this thing that we
no one knew how to do and so I guess we
started using the term mechanistic
inability to try to sort of draw that
that divide or to distinguish ourselves
and the work that we were doing in some
ways from from some of these other
things and I think since then it's
become this sort of umbrella term for um
you know pretty wide variety of work but
I'd say that the things that that are
kind of distinctive are I think a this
this focus on we really want to get at
you know the mechanisms we want to get
at the algorithms um you know if you
think of if you think of neural networks
as being like a computer program um then
the weights are kind of like a binary
computer program and we'd like to
reverse engineer those weights and
figure out what algorithms are running
so okay I think one way you might think
of trying to understand a neural network
is that it's it's kind of like a we have
this compiled computer program and the
weights of the neural network are are
the binary um and when the neural
network runs that's that's the
activations um and our our goal is
ultimately to go and understand and
understand these weights and so you know
the project mechanistic inability is to
somehow figure out how do these weights
correspond to
algorithms um and in order to do that
you also have to understand the
activations because it's sort of the
activations are like the memory and if
you if you imagine reverse engineering a
computer program um and you have the
binary instructions you know in order to
understand what what a particular
instruction means you need to know what
me what what is stored in the memory
that it's operating on and so those two
things are very intertwined so
mechanistic interpret tends to be
interested in both of those things now
you there's a lot of work that's that's
interested in in in those things um
especially the you know there's all this
work on probing which you might see as
part of being mechanistic interality
although it's you know again it's just a
broad term and and not everyone who does
that work would identify as doing Mech I
think the thing that is maybe a little
bit distinctive to the the vibe of
mechant turp is I think people tend
working in the space tend to think of
neural networks as well maybe one way to
said is that greent descent is smarter
than you that you know uh and gradient
descent is is actually really great the
whole reason that we're understanding
these models is because we didn't know
how to write them in the first place the
gradient descent comes up with better
Solutions than us and so um I think that
maybe another thing about mechant turp
is sort of having almost a kind of
humility that we won't guess at prior
what's going on inside the model and we
have to have the sort of bottom up
approach where we don't really assume
you know we don't assume that we should
look for a particular thing and that
will be there and that's how it works
but instead we look from the bottom up
and discover what happens to exist in
these models and study them that way but
you know the very fact that it's
possible to do and as you and others
have shown over time you know things
like
universality
that the wisdom of The gradian Descent
creates features and circus creates
things universally across different
kinds of networks that are useful and
that makes the whole field possible yeah
so this is actually is indeed a a really
remarkable and exciting thing where it
does seem like at least to some extent
you know the same the same elements the
same the same features and circuits form
again and again um you know you can look
at every Vision model and you'll find
curve detectors and you'll find high low
frequency detectors um and in fact
there's some some reason to think that
the same things form across you know
biological neural networks and
artificial neural networks so a famous
example is Vision Vision models in in
the early layers they have Gabor filters
and there's you know Gabor filters are
something that neuroscientists are
interested and have thought a lot about
we find curved detectors in these models
curve detectors are also found in
monkeys we discover these high low
frequency detectors and then um some
followup work went and discovered them
um in rats um or mice um so they were
found first in artificial neural
networks and then found in biological
neural networks um you know this really
famous result on like grandmother
neurons or the um the Haley Berry neuron
from quiroa at all and we found very
similar things in in Vision models where
this is while I was still at open Ai and
I I was looking at their clip model um
and you find um these neurons that
respond to the same entities in images
and also to give a concrete example
there we found that there was a Donald
Trump n for some reason I guess Everyone
likes to talk about Donald Trump and and
Donald Trump was very prominent was was
very a very Hot Topic at that time so
every every neural network that we
looked at we would find a dedicated
neuron for Donald Trump um that was the
only person who had always had a
dedicated nuron um you know sometimes
you'd have an Obama nuran sometimes
you'd have a Clinton Nan but uh Trump
always had a dedicate so it responds to
you know pictures of his face and the
ward Trump like all these things right
um and so it's it's not responding to a
particular example or like it's not just
responding to his face it's it's
abstracting over this General concept
right so in any case that's very similar
to these qu results so there this
evidence that these that this fomen of
universality the same things form across
both artificial and and natural neural
networks that's that's a pretty amazing
thing if that's true um you know it
suggests that um well I think the thing
that it suggests is the gradi scent is
sort of finding you know the right ways
to cut things apart in some sense that
many systems converge on and and many
different neural networks architectures
converge on that there's there's some
natural set of you know there's some set
of abstractions that are a very natural
way to cut apart the problem and that a
lot of systems are going to converge on
um that would be my my kind of uh you
know I don't know anything about
Neuroscience this is this is just my my
kind of wild speculation from what we've
seen yeah that would be beautiful if
it's sort of agnostic to the
medium of uh of the model that's used to
form the representation yeah yeah and
it's you know it's um a a kind of a wild
speculation based you know we only have
some a few data points justest this but
you know it it does seem like there's um
there's some sense in which the same
things form again again and again and
again both in certainly in natural
neural networks and and also
artificially or in biologically and the
intuition behind that would be that you
know where in order to be useful in
understanding the real world you need
all the same kind of stuff yeah well if
we pick I don't know like the idea of a
dog right like you know there's some
sense in which the idea of a dog is like
an a a natural category in the universe
or something like this right like you
know
uh uh there's there's some reason it's
it's not just like a weird Quirk of like
how humans Factor you know think about
the world that we have this concept of a
dog it's it's in some sense or or like
if you have the idea of a line like
there's you know like look around us you
know the you know there are lines you
know it's sort of the simplest way to
understand this room in some sense is to
have the idea of a line and so um I
think that that would be my instinct for
why this happens yeah you need a curved
line you know to understand a circle and
you need all those shapes to understand
bigger things and yeah it's a hierarchy
of Concepts that are formed yeah and
like maybe there are ways to go and
describe you know images without
reference to those things right but
they're not the simplest way or the most
economical way or something like this
and so systems converge to these um
these these strategies would would be my
my wild wild hypothesis can you talk
through some of the building blocks that
we've been referencing of features and
circuits so I think you first described
them in uh 2020 paper zoom in and
introduction to circuits absolutely so
um maybe I'll start by just describing
some phenomena and then we can sort of
build to the idea of features and
circuits so um if you spent like quite a
few years maybe maybe like five years to
some extent um with other things
studying this one particular model
Inception V1 um which is this one Vision
model it was um state-ofthe-art in 2015
um and uh uh you know very much not
state-ofthe-art anymore um and it has
you know maybe about 10,000 neurons and
and I spent a lot of time looking at the
10,000 neurons
odd neurons of of inception V1
um and one of the interesting things is
you know there are lots of neurons that
don't have some obvious intal meaning
but there's a lot of neurons on
Inception V1 that do have really clean
intal meanings um so you find neurons
that just really do seem to detect
curves and you find neurons that really
do seem to detect cars and um car wheels
and car windows and you know floppy ears
of dogs and dogs with long snouts facing
to the right and dogs with Longs Nots
facing to the left and you know
different kinds of far and there's
there's sort of this whole beautiful
Edge detectors line detectors color
contrast detectors um these beautiful
things we call high low frequency
detectors you know I think looking at I
sort of felt like a biologist you know
you just you're looking at at this sort
of new world of proteins and you're
discovering all these these different
proteins that
interact um so one way you could try to
understand these models is in terms of
neurons you could try to be like oh you
know there's a dog detecting neuron and
um here's a car detecting neuron and it
turns out you can actually ask how those
connect together so you can go and say
oh you know I have this car detecting on
how was it built and it turns out in the
previous layer it's connected really
strongly to a window detector and a
wheel detector and a sort of car body
detector and it looks for the window
above the car and the wheels below and
the car chrome sort of in the middle
sort of everywhere but especially on the
lower part um and that's sort of a
recipe for a car right like that is you
know earlier we said the thing we wanted
from mechor was to get algorithms to go
and get you know ask what is the the
algorithm that runs well here we're just
looking at the weights of the N Network
reading off this kind of recipe for
detecting cars it's a very simple crude
recipe but it's it's there and so we
call that a circuit this this connection
well okay so the the problem is that not
all of the neurons um are interpal and
there there's reason to think um we can
get into this more later that there's
this this superos hypothesis there
reason to think that sometimes the right
unit to analyze things in terms of um is
combinations of neurons so sometimes
it's not that there's a single neuron
that represents say a car um but it
actually turns that after you detect the
car the model sort of hides a little bit
of the car in the following layer and a
bunch of a bunch of dog detectors why is
it doing that well you know maybe it
just doesn't want to do that much work
on on on on cars at that point and you
know it's sort of storing it away to go
and um uh so it turns out then that the
sort of subtle pattern of you know
there's all these neurons that you think
are dog detectors and maybe they're
primarily that but they all a little bit
contribute to representing a car um in
in that next layer okay so so now we
can't really think there there might
still be some something I don't know you
could call it like a car concept or
something but it no longer corresponds
to a neuron so we need some term for
these kind of neuron-like entities these
things that we sort of would have liked
the neurons to be these idealized
neurons um the things that are the nice
neurons but also maybe there's more of
them somehow hidden and we call those
features and then what are circuits so
circuits are these connections of
features right so so when we have the
car detector um and it's connected to a
window detector and a wheel detector and
it looks for the Wheels below and the
windows on top um that's a circuit um so
circuits are just collections of
features connected by weights um and
they they Implement algorithms so they
tell us you know how is how are features
used how are they built um how do they
connect together so maybe it's it's it's
worth trying to pin down like what what
really um is the the core hypothesis
here I think the the core hypothesis is
something we call the linear
representation hypothesis so um if we
think about the car detector you know
the more it fires the more we sort of
think of that as meaning oh the model is
more and more confident that um a car
was present um or you know if it's some
combination of neurons that represent a
car you know the more that combination
fires the more we think the model thinks
there's a car present um this doesn't
have to be the case right like you could
imagine something where you have you
know you have this car detector neuron
and you think ah you know if it fires
like you know between one and two that
means one thing but it means like
totally different if it's between three
and four um that would be a nonlinear
representation and principle that you
know models could do that I think it's
it's sort of inefficient for them to do
if you try to think about how you'd
Implement computation like that it's
it's kind of an annoying thing to do but
in principal models can do that um so uh
one way to think about the features and
and circuits sort of framework for
thinking about things is that we're
thinking about things as being linear
we're thinking about there as being um
that if a if a neuron or a combination
of neurons fires more it's sort of that
means more of the of a particular thing
being detected and then that gives
weights a very clean interpretation as
these edges between these these entities
that these features um and that that
edge then has a has a meaning um so
that's that's in some ways the the core
thing um it's it's like um you know we
can talk about this sort of outside the
context of ns are you familiar with the
word toac results um so you have like
you know King minus man plus woman
equals Queen well the reason you can do
that kind of arithmetic um is because
you have a linear representation can you
actually explain that representation a
little bit so first off so a feature is
a is a direction of activation you think
it that way can you do the the the minus
men plus women that that the war Toc
stuff can you explain what that is yeah
there's this very such a simple clean
explanation of what we're talking about
exactly yeah so there's this very famous
result word toac by um Thomas mikov at
all and there's been tons of follow-up
work exploring this so so sometimes we
have these we create these word
embeddings um where uh we map every word
to a vector I mean that in itself by the
way is is kind of a crazy thing if you
haven't thought about it before right
like we we're we're going and and
representing we're turning um you know
like like if if you just learned about
vectors in physics class right uh and
I'm like oh I'm going to actually turn
every word uh in the dictionary into a
vector that's kind of a crazy idea okay
but you could imagine um you could
imagine all kinds of ways in which you
might map words to to
vectors but it it it seems like when we
train neural networks um they like to go
and and map words detectors to such that
they're they're they they sort of linear
structure in a particular sense which is
that directions have meaning so for
instance if you there there will be some
direction that seems to sort of
correspond to gender and male words will
be you know far in One Direction and
female words will be in another
Direction and the linear representation
hypothesis is you you could sort of
think of it roughly as saying that
that's actually kind of the fundamental
thing that's going on that that
everything is just different directions
have meanings and adding different
Direction vectors together can represent
Concepts and the michelov paper sort of
took that idea seriously and one
consequence of it is that you can you
can do this game of playing sort of
arithmetic with words so you can do king
and you can you know subtract off the
word man and add the word woman and so
you're sort of you know going and and
trying to switch the gender and indeed
if you do that the result will sort of
be close to the word Queen um and you
can you know do other things like you
can do um uh you know Sushi minus Japan
plus Italy and get pizza or uh different
things like this right um so so this is
in some sense the core of the linear
representation hypothesis you can
describe it just as a purely abstract
thing about Vector spaces you can
describe it as a as a statement about um
about the activations of neurons um but
it's really about this this property of
directions having meaning and in some
ways it's even a little subtle than that
it's really I think mostly about this
property of being able to add things
together um that you can sort of
independently modify um say gender and
royalty or
um you know Cuisine typee or country and
and and and the concept of food by by
adding them do you think the linear
hypothesis holds that carries scales so
so far I think everything I have seen is
consistent with this hypothesis and it
doesn't have to be that way right like
like you can write down neural networks
where um you write weights such that
they don't have linear representations
where the right way to understand them
is not is not in terms of linear
representations but I think every
natural neural network I've seen um Hess
property um there's been one paper
recently um that there's been some sort
of pushing around the edges so I think
there's been some work recently studying
multi-dimensional features where rather
than a single Direction it's more like
um a manifold of directions this to me
still seems like a linear representation
um and then there's been some other
papers suggesting that maybe um in in
very small models you get nonlinear
representations um I think that the
jury's still out on that
um but in I think everything that we've
seen so far has been consistent with the
linear representation hypothesis and
that's that's wild it it doesn't have to
be that way um and yet uh I think that
there's a lot of evidence that certainly
at least this is very very widespread
and so far the evidence is is consistent
with that and I and I I think you know
one thing you might say is you might say
well Christopher you know it's that's a
lot you know to to go and and sort of um
to ride on you know if we don't know for
sure this is true and you're sort of you
know you're investigating all not works
as though it is true you know isn't that
um isn't that dangerous well you know
but I I think actually there's a there's
a virtue in taking hypotheses seriously
and pushing them as far as they can go
um so it might be that someday we
discover something that is inconsistent
with linear representation hypothesis
but science is full of hypothesis and
theories that were wrong um and we
learned a lot by sort of working under
under them as a sort of an assumption um
and and then going and pushing them as
far as we can I guess I guess this is
sort of the heart of of what would
call normal normal science um um I don't
know if you want we can talk a lot about
about uh philosophy of science and uh
that leads to the paradigm shift so yeah
I love it taking the hypothesis
seriously and take it to a natural
natural conclusion yeah same with the
scaling hypothesis same exactly exactly
and I love it one of my colleagues Tom
henigan who is a former physicist um
like made this really nice analogy to me
of um uh caloric Theory where you know
once upon a time we thought that heat
was actually you know this thing called
caloric and like the reason you know hot
objects you know would would warm up
cool objects is like the caloric is
flowing through them um and like you
know because we're so used to thinking
about about heat you know in terms of
the modern modern Theory you know that
seems kind of silly but it's actually
very hard to construct uh an experiment
that that sort of disproves the um
chloric hypothesis um and you know you
can actually do a lot of really useful
work believing in chloric for example it
turns out that the original combustion
engines were developed by people who
believe in the caloric Theory so I think
this a virtue in taking hypotheses
seriously even when they might be wrong
yeah yeah there's a deep philosophical
truth to that that's kind of kind of how
I feel about space travel like
colonizing Mars there's a lot of people
that criticize that I think if you just
assume we have to colonize Mars in order
to have a backup for human civilization
even if that's not true that's going to
produce some interesting interesting
engineering and even scientific
breakthroughs I think yeah well and
actually this is another thing that I
think is really interesting so um you
know there a way in which I think it can
be really useful for society to have
people um almost irrationally dedicated
to investigating particular hypothesis
um because uh well it it takes a lot to
sort of maintain scientific morale and
really push on something when you know
most most SCI scientific hypotheses end
up being wrong you know a lot of a lot
of science doesn't doesn't work out um
and but and yet it's you know it's very
it's very useful to go do you know um
there's a there's a joke about Jeff
Hinton um which is that uh Jeff Hinton
has discovered how the brain works every
year for the last 50 years yeah um but
you know I I say that with like you know
the you know with really deep respect
because uh in fact that's actually you
know that that led to him doing some
some really great work yeah he won the
Noel prize Now Who's Laughing Now
exactly exactly exactly um yeah I think
one want to be able to pop up and sort
of recognize the the appropriate level
of confidence but I think there's also a
lot of value and just being like you
know I'm going to essentially assume I'm
going to condition on this problem being
possible or this being broadly the right
approach and I'm just going to go and
assume that for a while and go and work
within that um and push really hard on
it um and you know if Society has lots
of people doing doing that for different
things um that's actually really useful
in terms of going and uh getting
to getting you know either really really
ruling things out right we can be like
well you know that didn't work we know
that somebody tried hard um or going in
and getting to something that that does
teach us something about the world so
another interesting hypothesis is the
superposition hypothesis can you
describe what superos is yeah so earlier
we were talking about word toac right
and we were talking about how you know
maybe you have One Direction that
corresponds to gender and maybe another
that corresponds to royalty and another
one that corresponds to Italy and
another one that corresponds to you know
food and and all these things well you
know um often times maybe these these uh
these Ward embeddings they might be 500
dimensions a thousand dimensions and so
if you believed that all of those
directions were
orthogonal um then you could only have
you know 500 Concepts and you know I I
love pizza um but like if I was going to
go and like give the like 500 most
important Concepts in um you know the
English language probably Italy wouldn't
be it's not obvious at least that Italy
would be one of them right because you
you have to have things like plural and
singular and U uh verb and noun and
adjective and you know um there's a lot
of things we have to get to before we
get to get to Italy um uh and Japan and
you know there's a lot of countries in
the world um and so how might it be that
models could you know simultaneously
have the linear representation
hypothesis be true and also represent
more things than they have directions so
so what does that mean well okay so if
if if linear representation hypothesis
is true something interesting has to be
going on now I'll I'll tell you one more
interesting thing before we we go and we
do that which is um you know earlier we
were talking about all these polymatic
neurons right um these neurons that you
know when we're looking at Inception V1
there's these nice neurons that like the
car detector and the curve detector and
so on that respond to lots of you know
to very coherent things but it's lots of
neurons that respond to a bunch of
unrelated things that's that's also an
interesting phenomenon um and it turns
out as well that even these neurons that
are really really clean if you look at
the weak activations right so if you
look at like you know the activation
where it's like activating 5% of of the
the you know of the maximum activation
it's really not the core thing that it's
expecting right so if you look at a a
curve detector for instance and you look
at the places where it's 5% active you
know you could interpret it just as
noise or it could be that it's that it's
doing something else there okay so so
how could that be
well there's this amazing thing in
mathematics um called compressed sensing
and it's it's actually this this very
surprising fact where if you have a high
dimensional space and you project it
into a low dimensional space ordinarily
you can't go and sort of unprojected and
get back your high dimensional Vector
right you threw information away this is
like you know you can't you can't invert
a rectangular Matrix um you can only
invert Square
matrices um but it turns out that that's
actually not quite true if I tell you
that the high dimensional Vector was
sparse so it's mostly zeros then it
turns out that you can often go and find
back um the uh the high dimensional
Vector with with very high probability
um so that's a surprising fact right it
says that you know you can um you can
you can have this High dimensional
Vector space and as long as things are
sparse um you can project it down you
can have a lower dimensional projection
of it and that works so the super
hypothesis is saying that that's what's
going on in neural networks that's for
instance that's what's going on in wart
edings the wart embeddings are able to
simultaneously have directions be the
meaningful thing and by exploiting the
fact that they're they're operating on a
fairly High dimensional space they're
actually and and the fact that these
concepts are right like you know you
usually aren't talking about Japan and
Italy at the same time um you know most
of the most of those Concepts you know
in most sentences Japan and Italy are
both zero they're not present at all um
and if that's true um then you can go
and have it be the case that um that you
can you can have many more of these sort
of directions that are meaningful these
features than you have dimensions and
some of when we're talking about neurons
you can have many more Concepts than you
have have neurons so that's the at a
high level super hypothesis now it has
this even Wilder implication which is um
to go and say that uh neural networks
are it may not just be the case that the
the representations are like of this but
the the computation may also be like
this you know the connections between
all of them and so in in some sense
neural networks may be shadows of much
larger sparer neural networks and what
we see are these
projections um and the super you the
strongest version of the super
hypothesis would be to take that really
seriously and sort of say you know there
there actually is in some sense this
this upstairs model this you know um
where where the neurons are really
sparse and all interpal and there's you
know the weights between them are these
really sparse circuits and that's what
we're
studying um and uh the thing that we're
observing is the shadow of it and we
need to find the original object and uh
the process of learning is trying to
construct a compression of the upstairs
model that doesn't lose too much
information in the projection yeah
finding how to fit it efficiently or
something like this um that grent is
doing this in fact so this sort of says
that gradient descent you know could it
could just represent a dense neural
network but it sort of says that
gradient descent is pleasantly searching
over the space of extremely sparse
models that could be projected into this
low dimensional space and this large
body of work of of people going and
trying to study sparse neural networks
right where you go and you have you
could design neural networks right where
where the edges are sparse and the
activations are sparse and you know my
sense is that work has gener
it feels very principled right it makes
so much sense and yet that that work
hasn't really panned out that well as my
impression broadly and I think that a a
potential answer for that is that
actually the neural network is already
sparse in some sense grading descent was
the whole time gradi you were trying to
go and do this gradiant descent was
actually in the behind the scenes going
and searching more efficiently than you
could through the space of sparse models
and going in learning whatever sparse
model was most efficient and then
figuring out how to fold it down nicely
to go and run conven on your GPU which
does you know nice dense Matrix
multiplies um and that you just can't
beat that how many Concepts do you think
can be shoved in into a neural network
depends on how sparse they are so there
there's probably an upper bound from the
number of parameters right because you
have to have you still have to have you
know print weights that go and connect
them together um so that's that's one
upper bound there are in fact all these
lovely results from compressed sensing
and the Johnson Linton stess Lemma and
and things like this um that they they
basically tell you that if you have a
vector space and you want to have almost
orthogonal vectors which is sort of
probably the thing that you want here
right so you you're going to say well
you know I'm going to give up on having
my my Concepts my features be strictly
orthogonal but I'd like them to not
interfere that much I'm going to have to
ask them to be almost orthogonal um then
this would say that it's actually you
know for once you set a threshold for
for what you're what you're willing to
accept in terms of how how much coine
similarity there is that's actually
exponential in the number of neurons
that you have so at some point that's
not going to even be the the limiting
factor um but um there some beautiful
results there and in fact it's probably
even better than that in some sense
because that's sort of is for saying
that you know any random set of features
could be active but in fact the features
have sort of a correlational structure
where some features you know are more
likely to co-occur and other ones are
less likely to co-occur and so neural
networks my guess would be can do do
very well in terms of going and uh
packing things in such to to the point
that's probably probably not the
limiting factor how does the problem of
polys semanticity enter the picture here
poly semanticity is this phenomenon we
observe where we look at many neurons
and the neuron doesn't just sort of
represent one one concept it's not it's
not a clean feature it responds to a
bunch of unrelated things and um
supersition is you can think of as as
being a hypothesis that explains the
observation of polys semanticity um so
poly semanticity is this observe
phenomenon and super is is a hypothesis
that um would explain it along with with
some other so that makes Mech turb more
difficult right so if you if you're
trying to understand things in terms of
individual neurons and you have
polymatic neurons you're in an awful lot
of trouble right I mean the easiest
answer is like okay well you know you're
looking at the neurons you're trying to
understand them this one responds to a
lot of things it doesn't have a nice
meaning okay we're you that's that's
that's bad um another thing you could
ask is you know ultimately we want to
understand the weights and if you have
two polymatic neurons and you know each
one responds to three things and then
you know the other neuron responds to
three things and you have weight between
them you know what does that mean does
it mean that like all three you know
like there's these nine you know nine
interactions going on it's a very weird
thing but there's also a deeper reason
which is related to the fact that neural
networks operate on really high
dimensional spaces so I said that our
goal was you know to understand neural
networks and understand the mechanisms
and one thing you might say is like well
why not it's just a mathematical
function why not just look at it right
like um you know one of the earliest
projects I did studied these these
neural networks that mapped two-
dimensional spaces to two- dimensional
spaces and you can sort of interpret
them in this beautiful way is like
bending manifolds mhm um why can't we do
that well you know as you have have a
higher dimensional space um the volume
of that space in some senses is
exponential in the number of inputs you
have and so you can't just go in
visualize it so we somehow need to break
that apart we need to somehow break that
exponential space into a bunch of things
that we you know some non-exponential
number of things that we can reason
about independently and the independence
is crucial because it's the Independence
that allows you to not have to think
about you know all the exponential
combinations of things and
things being monomatic things only
having one meaning things having a
meaning that isn't is the key thing that
allows you to think about them
independently and so I think that's that
if you want the deepest reason why we
want to have um interpal monatic
features I think that's really the the
Deep reason and so the goal here as your
recent work has been aiming at is how do
we extract the mod semantic features
from a neural net that has politic
features and all this this mess yes we
have the have we observe these polyur
and we hypothesize that's what's going
what's going on at superos and if
superos is what's going on there there's
actually a sort of wellestablished
technique that is sort of the principled
thing to do which is dictionary learning
and um it turns out if you do dictionary
learning in particular if you do sort of
a nice efficient way that in some in
some sense sort of nicely regularizes it
well as well called a sparse Auto
encoder if you train a sparse Auto
encoder these beautiful interpal
features start to just fall out where
there weren't any beforehand and so
that's notot of thing that you would
necessarily predict right but it turns
out that that works very very well you
know to me that seems like you know some
non-trivial validation of linear
representations and supersession so with
dictionary learning you're not looking
for particular kind of categories you
don't know what they
arege and this gets back to our earlier
point right when we're not making
assumptions grading descent is smarter
than us so we're not making assumptions
about what's there um I mean one
certainly could do that right one could
assume that there's a PHP feature and go
and search for it but we're not doing
that we're saying we don't know what's
going to be there instead we're just
going to go and let um the sparse Auto
encoder discover the things that are
there so can you uh talk to the to monos
semanticity paper from October last year
that had a lot of like nice breakthrough
results that's very kind of you to
describe it that way um yeah I mean this
was um uh our first real success using
sparse Auto encoders so we took a one
layer model um and it turns out if you
go and you you know do dictionary
learning on it you find all these really
nice interpal features so you know the
Arabic feature the Hebrew feature um the
Bas 64 feature those were were some some
examples that we studied in a lot of
depth and really showed that they were
um what we thought they were it turns if
you train a model twice as well and
train two different models and and do
dictionary learning you find find
analogous features in both of them so
that's fun um you find all kinds of of
different features so that was really
just showing um that um that this works
and um you know I should mention that
there was this cunning home at all um
that had very similar results around the
same time there's something fun about
being doing these kinds of small scale
experiments and finding that it's
actually working yeah well and there's
and there's so much structure here like
you you know so maybe maybe stepping
back for a while um I thought that maybe
all this mechanistic can really work um
the end result was going to be that I
would have an explanation for why it was
sort of you know very hard and not going
to be tractable um you know we'd be like
well there's this problem with
supersession and it turns that super
session is really hard um and we're kind
of screwed but that's not what happened
in fact a very natural Le technique just
works and so then that's actually a very
good situation you know I think um this
is a sort of hard research problem and
it's got a lot of research risk and you
know it it might still very well fail
but um I think that some amount of some
very significant amount of research risk
um was sort of put behind us when that
started to work can you describe what
kind of features can be extracted in
this way well so it depends on the model
that you're studying right so the the
larger the model the more sophisticated
they're going to be and we'll probably
talk about about follow-up work in a
minute but in these one layer models um
so some very common things I think were
were languages both programming
languages and natural languages there
were a lot of features that were um
specific words in specific contexts so
the and I think really the way to think
about this is that the is likely about
to be followed by a noun so it's really
you could think of this as the feature
but you could also think of this as
producting a specific noun feature and
there would be these features that would
fire for the in um the context of of say
a legal document or a mathematical
document or something something like
this um and so uh you know maybe in the
context of math you're like you know the
and then predict Vector Matrix you know
all these mathematical words whereas you
other contexts you would predict other
things that was that was common and
basically we you need clever humans to
assign labels to what we're seeing yes
so you know this this is the only thing
this is doing is that sort of um
unfolding things for you so if
everything was sort of folded over top
of it you know cation folded everything
on top of itself you can't really see it
this is unfolding it but now you still
have a very complex thing to try to
understand um so then you have to do a
bunch of work understanding what these
are um and some of them are really
subtle like there's some really cool
things even in this this one layer model
about um Unicode where you know of
course some languages are in Unicode and
the tokenizer won't necessarily have a
dedicated token for every um Unicode um
character so instead what you'll have is
you'll have this these patterns of
alternating token or alternating tokens
that each represent half of a unic code
character and then you have a different
feature that you know goes and activates
on the on the opposing ones to be like
okay you know um I just finished a
character you know go and predict the
next prefix um then okay on the prefix
you know predict a reasonable suffix um
and you you have to alternate back and
forth so there's you know these these
wer models are are really interesting
and um uh I mean there's another thing
which is you might think okay there
would just be one b64 feature but it
turns out there's actually a bunch of
b64 features because you can have
English text encoded in as b64 and that
has a very different distribution of B
64 tokens than than regular and there's
um uh there's there's some things about
tokenization as well that it can exploit
and I don't know there all all kinds of
fun stuff how difficult is the task of
sort of assigning labels to what's going
on can this be automated by AI well I
think it depends on the feature and it
also depends on how much you trust your
AI so um there's a lot of work doing um
automated inability I think that's a
really exciting Direction and we do a
fair amount of automated inter and have
have Claude go and label our features is
there some fun moments where it's
totally right or it's totally wrong yeah
well I think I think it's very common
that it's like says something very
general which is like true in some sense
but not really picking up on the
specific of what's going on um so I
think I think that's a pretty common
situation um you don't know that I have
a particularly amusing one that's
interesting that little gap between it
is true but it doesn't quite
get to the Deep Nuance of a thing yeah
that's a general challenge it's like
it's it's St an incredible colish they
can say a true thing but it doesn't it's
qu it's not it's missing the depth
sometimes and in this context it's like
the arc challenge you know the sort of
IQ type tests it feels like figuring out
what a feature represents is a bit of is
a little puzzle you have to solve yeah
and and I think that sometimes they're
easier and sometimes they're harder as
well um so
uh yeah I think I think that's tricky
now there's another thing which I don't
know maybe maybe in some ways this is my
like aesthetic coming in but I'll give
try to give you a rationalization you
know I'm actually a little suspicious of
automated inability and I think that
partly just that I want humans to
understand neural net works and if the
neural network is understanding it for
me you know I'm I'm not I don't quite
like that but I do have bit of a you
know in some ways I'm sort of like the
mathematicians who are like you know if
there a computer automated proof it
doesn't count U you know you they won't
understand it but I I do also think that
there is um this kind of like
Reflections on trusting trust type issue
where you know if you there's this
famous talk about um uh you know you
like when you're writing a computer
program you have to trust your compiler
and if there was like malware in your
compiler then it could go and inject
malware into the next compiler and you
know you'd be kind of in trouble right
well if you're using neural networks to
go and um verify that your neural
networks are safe the hypothesis that
you're testing for is like okay well the
neural network maybe isn't safe um and
you have to worry about like is there
some way that it could be screwing with
you
um so uh you know I I think that's not a
big concern now um but I do Wonder in
the long run if we have to use really
powerful system AI systems to go and uh
you know audit our AI systems is that is
that actually something we can trust but
maybe I'm just rationalizing because I I
just want to us to have to get to a
point where humans understand everything
yeah I mean especially that's hilarious
especially as we talk about AI safety
and it looking for features that would
be relevant to AI safety like deception
and so on uh so let's let's talk about
the scaling a semanticity paper in May
2024 okay so what did it take to scale
this to apply to Claude 3 on it well a
lot of gpus a lot more gpus um but one
of my teammates Tom henigan um was
involved in the original scaling loss
work um and something that he was sort
of interested in from very early on is
are there scaling laws for
inability um and so um something he sort
of immediately did when when this work
started to succeed and we started to
have sparse Auto encoders work we became
very interested in you know what are the
scaling laws for um uh you know for
making making sparse Auto encoders
larger and how does that relate to
making the base model larger um and so
um it turns out this works really well
and you can use it to sort of project um
you know if you train a sparse Auto
encod a given size you know how many
tokens should you train on and so on so
this was actually a very big help to us
in scaling up um this work um and made
it a lot easier for us to go and train
um you know really large sparse Auto
encoders where you know um it's not like
training the big models but it's it's
starting to get to a point where it's
actually actually expensive to go um and
train the really big ones so you have to
I mean you have to do all the stuff of
like splitting it across large I mean
there's a huge engineering challenge
here too right so yes so so there's
there's a there's a scientific question
of how do you scale things effectively
um and then there's an enormous amount
of engineering to go and scale this up
you have to you have to chart it you
have to you have to think very carefully
about a lot of things I'm lucky to work
with a bunch of great Engineers cuz I am
definitely not a great engine yeah on
the infrastructure especially yeah for
sure so it turns out tldr it worked it
worked yeah and and I think this is
important because you could have
imagined you could like you could have
imagined a world where you set after
towards monos fanticy you know Chris
this is great you know it works on a one
layer model but one layer models are
really idiosyncratic um like you know
maybe maybe there just something ID like
maybe the linear representation
hypothesis and super hypothesis is the
right way to understand a one layer
model but it's not the right way to
understand large models um and so I
think um I mean first of all like The
Cutting him at all paper sort of um cut
through that a little bit and and sort
of suggested that this wasn't the case
but um scaling onity sort of I think was
significant evidence that even for very
large models and we did it on Claude 3
sauna which at that point was uh one of
our production models um you know even
these models um seem to be very you know
seem to be substantially explained at
least by linear features and you know
doing dictionary learning on them works
and as you learn more features you go
and you explain explain more and more so
that's a I think a quite a promising
sign and you find now really fascinating
abstract features um and the features
are also multimodal they respond to
images and text for the same concept
which is fun yeah this can you explain
that I mean like you know back door
there's just a lot of examples that you
can yeah so maybe maybe let's start with
a one example to start which is we found
some features around sort of security
vulnerabilities and back doors and codes
so it turns out those are actually two
different features um so there's a
security vulnerability feature and if
you force it active Claude will start to
go and write um security vulnerabilities
like buffer overflows into code and it
also it fires for all kinds of things
like you know some of some of the top
data set examples for it were things
like you know dash dash disable um you
know SSL or something like this which
are sort of obviously really um uh
really insecure so at this point it's
kind of like maybe it's just because the
examples are presented that way it's
kind of like surface a little bit more
obvious examples right um I guess the
the idea is that down the line might be
able to detect more Nuance like
deception or bugs or that kind of stuff
yeah well I maybe I want to distinguish
two things so um one is um the
complexity of the feature or the concept
right and the other is
the the Nuance of the how subtle the
examples we're looking at right so when
we when we show the top data set
examples those are the most extreme
examples that that feature to to
activate um and so it doesn't mean that
it doesn't fire for more subtle things
so the UN you know the insecure um code
feature you know the stuff that it fires
for most strongly for are these like
really obvious you know disable the
security type things um but um um you
know uh it it also Fires for you know
buffer overflows and and more subtle
security vulnerabilities in code you
know these features are all multimodal
so you could ask like what images
activate this feature and it turns out
um that the uh the the security
vulnerability feature activates for
images of um uh like people clicking on
Chrome to like go past the like you know
this this website uh the SSL certificate
might be wrong or something like this
another thing that's very entertaining
is there's backd doors en code feature
like you activate it it goes and Cloud
writes a back door that like will go and
dump your data to port or something but
you can ask okay what what images
activate the back door feature it was
devices with hidden cameras in them so
there's a whole apparently genre of
people going and selling devices that
look in uous that have hidden cameras
and they have ads that how there's a
hidden camera in it and I guess that is
the you know physical version of a back
door um and so it sort of shows you how
abstract these concepts are right um and
I I just thought that was uh I I'm sort
of sad that there's a whole Market of
people selling devices like that but I
was kind of delighted that that was the
the thing that it came up with as the
the top uh image examples for the
feature yeah it's nice it's multimodal
it's multi almost context it's it's as
broad strong definition of a singular
concept it's nice yeah to me one of the
really interesting features especially
for AI safety is deception and lying and
the possibility that these kinds of
methods could detect uh lying in a model
especially gets smarter and smarter and
smarter presumably that's a big threat
of a super intelligent model that he can
deceive the people operating
it uh as to its intentions or any of
that kind of stuff so what what have you
learned from detecting lying inside
models yeah so I think we're in some
ways in early days for that we find
quite a few features related to
deception and lying there's one feature
where fires for people lying and being
deceptive and you force it active and
Claude starts lying to you so we have a
have a deception feature I mean there's
all kinds of other features about
withholding information and not
answering questions features about power
seeking and coups and stuff like that
this a lot of features that are kind of
related to Spooky things and if you um
force them active Claude will will
behave in ways that are they're not the
kind of behaviors you want what are
possible next exciting directions to you
in the space of uh Mech and well there's
a lot of things
um so for one thing I would really like
to get to a point where we have circuits
where we can really understand um not
just the features uh but then use that
to understand the computation of models
um that really for me is is the the
ultimate goal of this um and there's
been some work we we put out a few
things there's a paper from Sam Marks
that does some stuff like this there's
been some I'd say some work around the
edges here um but I think there's a lot
more to do and I think that will be a
very exciting thing um that's related to
a challenge we call interference weights
um where um due to supersition if you
just sort of navely look at whether
featur are connected together there may
be some weights that sort of don't exist
in the upstairs model but are just sort
of artifacts of of superposition so
that's a a sort of technical challenge
related to that
um I think another exciting direction is
just I you know you might think of of
sparse Auto encoders as being kind of
like a telescope they allow us to you
know look out and see all these features
that are are are are out there and you
know as we build better and better
sparse Auto en Cutters get better better
at dictionary learning we see more and
more stars um and you know we zoom in on
smaller and smaller stars but there kind
of um a lot of evidence that we're only
still seeing a very small fraction of
the Stars there's a lot of matter in our
in our you know neural network universe
that we can't observe yet um and it may
be that um that we'll never be able to
have fine enough instruments to observe
it and maybe maybe some of it just isn't
possible um isn't computationally
tractable to observant there's sort of a
a kind of dark matter and in not in
maybe the sense of of astronomy of
earlier astronomy when we didn't know
what this unexplained matter is um and
so I I think a lot about that that dark
matter and whether will ever observe it
and what that means for safety if we if
we can't observe it if there's you know
some if some significant fraction of nor
networks are not accessible to us um
another question that I think a lot
about is uh at the end of the day you
know mechanistic inter is it's very
microscopic um approach to interality
it's trying to understand things in a
very fine grained way but lot of the
questions we care about are very
macroscopic um you know we we care about
these questions about neural network
behavior and
and I think that's the thing that I care
most about but there's there's lots of
other other sort of larger scale
questions you you might care about um
and somehow you know the nice thing
about about having a very microscopic
approach is it's maybe easier to ask you
know is this true but the downside is
it's much further from the things we
care about and so we now have this
ladder to climb and I think there's a
question of can will we be able to find
are there are there sort of larger scale
abstractions that we can use to
understand nural networks that can we
get up from this very microscopic
approach yeah you've you you've written
about this this kind of organs question
yeah exactly if we uh think of
interpretability as a kind of anatomy of
neural networks most of the circus
threads involve studying tiny little
veins looking at the small scale and
individual neurons and how they connect
however there are many natural questions
that the small scale approach doesn't
address in contrast the most prominent
abstractions in biological Anatomy
involve larger scale structures like
individual organs like the heart or
entire organ systems like the
respiratory system and so we wonder is
there a respiratory system or heart or
brain region of an artificial neuron
Network yeah exactly um and I mean like
if you think about science right a lot
of scientific Fields have um you know
investigate things that many level of
abstractions in biology you have like
you know molecular biology studying you
know proteins and molecules and so on
and you have cellular biology and then
you have histology studying tissues and
you have anatomy and then you have
zoology and then you have ecology and so
you have many many levels of abstraction
or you know physics maybe the physics of
individual particles and then you know
statistical physics gives you gives you
thermodynamics and things like this and
so you often have different levels of
abstraction um and I think that right
now we have you know mechanistic
interpret if it succeeds is sort of like
a microbiology of neural networks but we
we want something more like anatomy and
so and you know a question you might ask
is why why can't you just go there
directly and I think the answer is super
um in at least in significant part it's
that it's actually very hard to to see
this this macroscopic structure U
without first sort of breaking down the
microscopic structure in the right way
and then studying how it connects
together um but I'm I'm hopeful that
there is going to be something much
larger than um features and circuits and
that we're going to be able to have a
story that's much than evolves much
bigger things and you then you can sort
of study in detail the parts you care
about as opposed to neurobiology like a
psychologist or psychiatrist when your
own network and I think that the
beautiful thing would be if we could go
and rather than having disperate fields
for those two things if you could have a
build a bridge between them such that
you could go and um uh have all of your
higher level abstractions be grounded
very firmly In This Very solid um you
know more rigorous ideally Foundation
what do you think is the difference
between the human brain the biological
neuron Network and the artificial neuron
Network well the neuroscientists have a
much harder job than us you know
sometimes I just like count my blessings
by how much easier my job is than the
neuroscientist right so I have um we we
can record from all the neurons yeah we
can do that on arbitrary amounts of data
um the neurons don't change while you're
doing that by the way MH um you can go
and ablate neurons you can edit the
connections and so on um and then you
undo those changes that's prettyy great
yeah um you can force any you can
intervene on any neuron and force it
active and see what happens um you know
which neurons are connected to
everything right you neuroscientists
want to get the connecto we have the
connecto um and we have it for like much
bigger than the elegant um and then not
only do we have the connectome um we
know uh what the you know which neurons
excite or inhibit each other right so we
have we it's not just that we know that
like the binary mask we know the the
weights um we can take gradients we know
computationally what each neuron does um
so I don't know the goes on and on we
just have um so many advantages over
neuroscientists and then despite having
all those advantages it's really hard
and so one thing I do sometimes think is
like gosh like if it's this hard for us
it seems impossible under the
constraints of Neuroscience or you know
near impossible um I I I don't know
maybe maybe part of me is like I've got
a few neuroscientists on my team maybe
maybe I'm sort of like ah you know um
the uh maybe the neuroscientists maybe
some of them would like to have an
easier problem that's still very hard um
and they they could come and work on on
neural networks and then after we after
we figure out things in sort of the easy
uh Little Pond of trying to understand
neural networks which is still very hard
then we then we could go back to
biological Neuroscience I love what
you've written about the goal of mechan
turp research as uh two goals safety and
Beauty so can you talk about the beauty
side of things yeah so you know there's
this funny thing where I think some
people want uh some people are kind of
disappointed by neural networks I think
where they're like ah you know neural
network
um it's these just these simple rules
then you just like do a bunch of
engineering to scale it up and it works
really well and like where's the like
complex ideas you know this isn't like a
very nice beautiful scientific
result and I sometimes think when people
say that I picture them being like you
know evolution is so boring it's just a
bunch of simple rules and you run
Evolution for a long time and you get
biology like what a what a a sucky uh
you know way for biology to have turned
out where's the the complex rules but
the beauty is that the Simplicity
generates complexity um you know biology
has these simple rules and it gives rise
to you know all the life and ecosystems
that we see around us all the beauty of
nature that all just comes from
Evolution and from something very simple
Evolution and similarly I think that
nural networks build you know create
enormous um complexity and Beauty inside
and structure inside themselves that
people generally don't look at and don't
try to understand because it's it's hard
to understand but I I think that there
is an Inc incredibly Rich structure to
be discovered inside n networks a lot of
a lot of very deep Beauty um if we're
just willing to take the time to go and
see it and understand it yeah I love I
love Mech inter the feeling like we are
understanding or getting glimpses of
understanding the magic that's going on
inside is really wonderful it feels to
me like one of the questions is just
calling out to be asked and I'm sort of
I mean a lot of people are think about
this but I'm often surprised that morar
is how is it that we don't know how to
create computer systems that can do
these things and yet we have these
amazing systems that we don't know how
to directly create computer programs
that can do these things but these
neural networks can do all these amazing
things and it just feels like that is
obviously the question that sort of is
calling out to be answered if you are if
you have any degree of curiosity it's
it's like how is it that that Humanity
now has these artifacts that can do
these things that we don't know how to
do yeah I love the image of the circus
towards the light of the objective
function yeah it's just it's it's this
organic thing that we've grown and we
have no idea what we've grown well thank
you for working on safety and thank you
for appreciating the beauty of the
things you uh discover and thank you for
talking today Chris this is wonderful
thank you for taking the time to chat as
well thanks for listening to this
conversation with Chris Ola and before
that with DAR amade and Amanda ascal to
support this podcast please check out
our sponsors in the description and now
let me leave you with some words from
Alan Watts
the only way to make sense out of change
is to plunge into it move with it and
join the
dance thank you for listening and hope
to see you next time