Transcript

_1f-o0nqpEI • DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters | Lex Fridman Podcast #459
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0814__1f-o0nqpEI.txt
Back Raw
Kind: captions
Language: en
the following is a conversation with
Dylan Patel and Nathan Lambert Dylan
runs semi analysis A well respected
research and Analysis company that
specializes in semiconductors gpus CPUs
and AI Hardware in general Nathan is a
research scientist at the Allen
Institute for AI and is the author of
the amazing blog on AI called
interconnects they are both highly
respected red and and listened to by the
experts researchers and engineers in the
field of AI and personally I'm just a
fan of the two of them so I used the
Deep seek moment that shook the AI World
a bit as an opportunity to sit down with
them and lay it all out from Deep seek
open AI Google xai meta anthropic to
Nvidia and tsmc and to us China Taiwan
relations and everything else that is
happening at the cutting Ed of AI this
conversation is a deep dive into many
critical aspects of the AI industry
while it does get super technical we try
to make sure that it's still accessible
to folks outside of the AI field by
defining terms stating important
Concepts explicitly spelling out
acronyms and in general always moving
across the several layers of abstraction
and levels of detail there is a lot of
hype in the media about what AI is and
isn't the purpose of this podcast in
part is to cut through the hype through
the and the low resolution
analysis and to discuss in detail how
stuff works and what the implications
are let me also if I may comment on the
new open AI 03 mini reasoning model the
release of which we were anticipating
during the conversation and it did
indeed come out right after its
capabilities and costs are on par with
our expectations as we
stated open AI 03 mini is indeed a great
model but it should be stated that uh
deep SEC car 1 has similar performance
on benchmarks is still cheaper and it
reveals its Chain of Thought reasoning
which O3 mini does not it only shows a
summary of the reasoning plus R1 is open
weight and uh 03 mini is
not by the way I got a chance to play
with uh O3 mini and uh anecdotal Vibe
checkwise I felt that O3 mini
specifically O3 mini high is uh better
than R1 still for me personally I find
that Claude Sona 35 is the best model
for programming except for tricky cases
where I will use 01 Pro to
brainstorm either way many more better
AI models will come including reasoning
models both from American and Chinese
companies
they will continue to shift the cost
curve but the quote deep seek moment is
indeed real I think it will still be
remembered 5 years from now as a pivotal
event in Tech History due in part to the
geopolitical implications but for other
reasons too as we discuss in detail from
many perspectives in this
conversation this is leex Freedman
podcast to support it please check out
our sponsors in the description and now
dear friends here's Dyan Patel and
Nathan
Lambert a lot of people are curious to
understand China's deep seki models so
let's lay it out Nathan can you describe
what deep seek V3 and deep seek R1 are
how they work how they're trained Let's
uh look at the big picture and then
we'll zoom in on the details yeah so
deep seek V3 is a new mixture of experts
Transformer language model from Deep
seek who is based in China they have
some new specifics in the model that
we'll get into largely this is a open
weight model and it's a instruction
model like what you would use in chat
GPT um they also release what is called
the base model which is before these
techniques of posttraining most people
use instruction models today and those
are what's served in all sorts of
applications this was released on I
believe December 26th or that week and
then weeks later on January 20th deep
seek released deep seek R1 which is a
reasoning model which really accelerated
a lot of this discussion this resenting
model has a lot of overlapping training
steps to deep seek V3 and it's confusing
that you have a base model called V3
that you do some too to get a chat model
and then you do some different things to
get a reasoning model I think a lot of
the AI industry is going through this
challenge of communications right now
where open AI makes fun of their own
naming scheme they have gbt 4 they have
open
ai1 and there's a lot of types of models
so we're going to break down what each
of them are there's a lot of technical
specifics on training and go from high
level to specific and kind of go through
each of them there's so many places we
can go here but maybe let's go to open
weights first what does it mean for
model to be open weights and what are
the different flavors of Open Source in
general yeah so this discussion has been
going on for a long time in AI it became
more important since chat gbt or more
focal since trat BT at the end of 2022
open weights is the accepted term for um
when model weights of a language model
are available on the internet for people
to download those weights can have
different licenses which is the
effectively the terms by which you can
use the model there are licenses that
come from history and open source
software there are licenses that are
designed by companies specifically um
all of llama deep seek quen mistol these
popular names in open weight models have
some of their own licenses it's
complicated because not all the same
models have the same
terms the big debate is on what makes a
model open weight it's like why are we
saying this term it's kind of a mouthful
it sounds close to open source but it's
not the same there's still a lot of
debate on the definition and soul of
open- source AI open source software has
a rich history on freedom to modify
freedom to take on your own freedom for
many restrictions on how you would use
the software and what that means for AI
is still being defined
so uh for what I do I work at the Allen
Institute for AI we're a nonprofit We
want to make AI open for everybody and
we try to lead on what we think is truly
open source there's not full agreement
in the community but for us that means
releasing the training data releasing
the training code and then also having
open weights like this and we'll get
into the details of the models and again
and again as we try to get deep into how
the models will train were trained we
will say things like the data processing
data filtering data quality is the
number one determinant of the model
quality and then a lot of the training
code is the determinant on how long it
takes to train and how faster
experimentation is so without fully
open- Source models where you have
access to this data it is hard to know
or it's harder to replicate so we'll get
into cost numbers for deeps B3 on mostly
GPU hours and how much you could pay to
rent those yourselves but without the
data the replication cost is going to be
far far higher and same goes for the
code we should also say that this is
probably one of the more open models out
of the frontier models so like in this
full spectrum where probably the fullest
open source like you said open code open
data open weights this is not open code
this is probably not open data
and this is open weights and the
licensing is uh MIT license or it's uh I
mean there's some nuance and the
different models but it's towards the
free in terms of the open source
movement these are the kind of the good
guys yeah deep seek is doing fantastic
work for disseminating understanding of
AI their papers are extremely detailed
in what they do and for other teams
around the world they're very actionable
in terms of improving your own training
techniques
uh and we'll talk about licenses more
the Deep seek R1 model has a very
permissive license it's called the M
license that effectively means there's
no Downstream restrictions on commercial
use there's no use case restrictions you
can use the outputs from the models to
create synthetic data and this is all
fantastic I think the closest pier is
something like llama where you have the
weights and you have a technical report
and the technical report is very good
for llama one of the most red p PDFs of
the year last year is the Llama 3 paper
but in some ways it's slightly less
actionable it has less details on the
training specifics like less plots um
and so on and the Llama 3 license is
more restrictive than MIT and then
between the deep sea custom license and
the Llama license we could get into this
whole Rabbit Hole I think we we we'll
make sure we want to go down the license
rabbit hole before we do specifics yeah
and I mean so it should be stated that
one of the implications of deep secret
puts pressure on llama and everybody
else on open AI to push towards uh open
source and that's the other side of Open
Source that uh you mentioned is how much
is published in detail about it so how
open are you with the sort of the
insights behind the code so like how
good is the technical reports are they
hand wavy or is there actual uh details
in there and that's one of the things
that deep seek did well is they publish
a lot of the details yeah especially in
the deeps V3 which is their pre-training
paper they were very clear that they are
doing inter itions on the technical
stack that go at many different levels
for example on their to get highly
efficient training they're making
modifications at or below the Cuda layer
for NVIDIA
chips I have never worked there myself
and there are a few people in the world
that do that very well and some of them
are at Deep seek and these types of
people are at Deep seek and leading
American frontier Labs but there are not
many places to help people understand
the other implication of open weights
just you know there's uh a topic we'll
return to often here
so there's a uh fear that China the
nation might have interest in um
stealing American data violating privacy
of American citizens what can we say
about open weights to help us understand
what what the weights are able to do
yeah in terms of stealing people's data
yeah so these weights that you can
download from hugging face or other
platforms are very big matrices of
numbers you can download them to a
computer in your own house that has no
internet and you can run this model and
you're totally control of your data that
is something that is different than how
a lot of language model usage is
actually done today which is mostly
through apis where you send your prompt
to gpus run by certain companies and
these companies will have different
distributions and policies on how your
data is stored if it is used to train
future models where it is stored if it
is encrypted and so on so the open
weights you have your fate of data in
your own hands and that is something
that is deeply connected to the soul of
Open Source so it's not the model that
steals your data it's clovers hosting
the model which could be China if you're
using the Deep seek app or it could be
perplexity uh you know you're trusting
them with your data or open AI you're
trusting them with your data and some of
these are American companies some of
these are Chinese companies but the
model itself is not doing the stealing
it's the host all right
so uh back to the basics what's the
difference between deep seek V3 and deep
seek R1 can we try to like lay out the
confusion potential yes so for one I
have very understanding of many people
being confused by these two model names
so I would say the best way to think
about this is that when training a
language model you have what is called
pre-training which is when you're
predicting the large amounts of mostly
internet text you're trying to predict
the next token and what to know about
these new deep seek models is that they
do this internet large scale
pre-training once to get what is called
Deep seek V3 base this is a base model
it's just going to finish your sentences
for you it's going to be harder to work
with than chat GPT and then what deep
seek did is they've done two different
posttraining regimes to make the models
have specific desirable behaviors so
what is the more normal model in terms
of the last few years of AI an instruct
model a chat model a quote unquote
aligned model a helpful model there are
many ways to describe this is more
standard post training so this is things
like instruction tuning reinforce
learning from Human feedback we'll get
into some of these words and this is
what they did to create the deeps V3
model this was the first Model to be
released and it is very high performant
it's competitive with gp4 llama 405b so
on and then when this
release was happening we don't know
their exact timeline or soon after they
were finishing the training of a
different training process from the same
next token prediction base model that I
talked about which is when this new
reasoning training that people have
heard about comes in in order to create
the model that is called Deep seek R1
the r through this conversation is good
for grounding for reasoning and the name
is also similar to open AI 01 which is
other reasoning model that people have
heard about and we have to break down
the training for R1 in more detail
because for one we have a paper
detailing it but also it is a far newer
set of techniques for the AI community
so is a much more rapidly evolving area
of research maybe we should also say the
big two categories of training of
pre-training and posttraining these
umbrella terms that people use so what
is pre-training and what is posttraining
and what are the different flavors of
things underneath posttraining umbrella
yeah so pre-training I'm using some of
the same words to really get the message
across is you're doing what is called
autor regressive prediction to predict
the next token in a series of documents
this is done over standard practice is
trillions of tokens so this is a ton of
data that is mostly scraped from the web
in some of deep se's earlier papers they
talk about their training data being
distilled for Math and I shouldn't use
this word yet but taken from common
crawl and that's a public access that
anyone listening to this could go
download data from the common crawl
website this is a crawler that is
maintained publicly yes other tech
companies eventually shift to their own
crawler and deepy likely has done this
as well as most Frontier Labs do but
this sort of data is something that
people can get started with and you're
just predicting text in a series of
documents this
is can be scaled to be very efficient
and there's a lot of numbers that are
thrown around in AI training like how
many floating Point operations or flops
are used and then you can also look at
how many hours of these gpus that are
used and it's largely one loss function
taken to a
very large amount of of compute usage
you just you set up really efficient
systems and then at the end of that you
have this space model and pre-training
is where there is a lot more of
complexity in terms of how the process
is emerging or evolving and the
different types of training losses will
use I think this is a lot of techniques
grounded in the natural language
processing literature the oldest
technique which is still used today is
something called instruction tuning or
also known as supervised fine tuning
these acronyms will be if or sft it's
that people really go back and forth
throughout them and I will probably do
the same which is where you add this
formatting to the model where it knows
to take a question that is like explain
the history of the Roman Empire iror to
me and or something you a sort of
question you'll see on Reddit or stack
Overflow and then the model will respond
in a information dense but presentable
manner the core of that formatting is in
this instruction tuning phase and then
there's two other categories of loss
functions that are being used today one
I will classify as preference fine
tuning preference fine tuning is a
generalized term for what came out of
reinforcement learning from Human
feedback which is rhf this reinforce
learning from Human feedback is credited
as the technique that helped uh chat GPT
break through it is a technique to make
the responses that are nicely formatted
like these Reddit answers more in tune
with what a human would like to read
this is done by collecting parse
preferences from actual humans out in
the world to start and now AIS are also
labeling this data and we'll get into
those trade-offs and you have this kind
of contrastive loss function between a
good answer and a bad answer and the
model learns to pick up these Trends
there's different implementation ways
you have things called reward models you
could have direct alignment algorithms
there's a lot of really specific things
you can do but all of this is about
fine-tuning to human
preferences and the final stage is much
newer and will'll link to what is done
in R1 and these reasoning models is I
think open ai's Nam for this they had
this new API in the fall which they
called the reinforcement fine-tuning
API this is the idea that you use the
techniques of reinforcement learning
which is a whole framework of AI there's
a deep literature here to summarize it's
often known as trial and error learning
or the subfield of AI where you're
trying to make sequential decisions in a
certain potentially un potentially noisy
environment there's a lot of ways we
could go down that but fine-tuning
language models where they can generate
an answer and then you check to see if
the answer matches the true solution for
math or code you have an exactly correct
answer for math you can have unit tests
for code and what we are doing is we are
checking the language models work and
we're giving it multiple opportunities
on the same questions see if it is right
and if you keep doing this the models
can learn to improve in verifiable
domains uh to a great extent it works
really well it's a newer technique in
the academic literature it's been used
at Frontier labs in the US that don't
share every detail uh for multiple years
so this is the idea of using
reinforcement learning with language
models and it has been taking off
especially in this deep seek moment and
we should say that there's a lot of
exciting stuff going on on the uh again
across the stack but the post training
probably this year there's going to be a
lot of interesting developments in the
post training we'll we'll talk about it
uh I almost forgot to talk about the the
difference between uh deep seek V3 and
R1 on the user experience side so forget
the technical stuff forget all that just
people that don't know anything about AI
they show up like what's the actual
experience what's the use case for each
one when they actually like type and
talk to it what what is he good at and
that kind of thing so let's start with
deep seek V3 again it's what more people
would have tried something like it you
ask it a question it'll start generating
tokens very fast and those tokens will
look like a very human legible answer
it'll be some sort of markdown list it
might have formatting to help you draw
to the core details in the answer and
it'll generate tens to hundreds of
tokens a token is normally a word for
common words or a subword part in a
longer word and it'll look like a very
high quality Reddit or stack Overflow
answer these models are really getting
good at doing these across a wide
variety of domains I think even things
that if you're an expert things that are
close to The Fringe of knowledge they
will still be fairly good at I think
Cutting Edge AI topics that I do
research on these models are capable for
study Aid and they're regularly updated
this changes is with the Deep seek R1
what is called these reasoning models is
when you see tokens coming from these
models to start it will be a
large chain of thought process we'll get
back to Chain of Thought in a second
which looks like a lot of tokens where
the model is explaining the problem the
model will often break down the problem
be like okay they asked me for this
let's break down the problem I'm going
to need to do this and you'll see all of
this generating from the model it'll
come very fast in most user experiences
these AP are very fast so you'll see a
lot of tokens a lot of words show up
really fast it'll keep flowing on the
screen and this is all the reasoning
process and then eventually the model
will change its tone in R1 and it'll
write the answer where it summarizes its
reading reasoning process and writes a
similar answer to the first types of
model but in deep seeks case which is
part of why this was so
popular even outside the AI Community is
that you can see how the language model
is breaking down problems and then you
get this answer on a technical side they
they train the model to do this
specifically where they have a section
which is reasoning and then it generates
a special token which is probably hidden
from the user most of the time which
says okay I'm starting the answer so the
model is trained to do this two-stage
process on its own if you use a similar
model and say openai open ai's user
interface is trying to summarize this
process for you nicely by kind of
showing the sections that the model is
doing and it'll kind of Click through
it'll say breaking down the problem
making
calculation cleaning the result and then
the answer will come for something like
open AI maybe it's useful here to go
through like an example of a deep seek
R1 reasoning yeah so the if if you're
looking at the screen here what you'll
see is a screenshot of the deep seek
chat app and at the top is thought for
1517 seconds with the drop- down arrow
underneath that if we were in an app
that we were running the drop- down
arrow would have the reasoning so in
this case uh the question the specific
question which you know
I'm philosophically SL pothead inclin so
this is uh asking deep deep SEC
car1 for one truly novel insight about
humans and it reveals the reasoning and
basically the TR truly novel aspect is
was pushing the reasoning to constantly
sort of the model asking itself is this
truly novel so it's actually challenging
itself to be more novel more
counterintuitive uh more uh less cringe
I suppose so some of the reasoning says
uh this is just snapshots alternatively
humans have a unique meta emotion where
they feel emotions about their own
emotions you feeling guilty about being
angry this recursive emotional layering
creates complex motivational drives that
don't exist in other animals the inside
is that human emotions are nested so
it's like it's reasoning through how
humans feel emotions it's reasoning
about meta emotions going to have pages
and Pages this it's almost too much to
actually read but it's nice to skim as
it's coming it's stream of it's a James
Joyce extreme of Consciousness and then
it goes wait the user wants something
that's not seen anywhere else let me dig
deeper and consider the human ability to
hold contradictory beliefs
simultaneously cognitive dissonance is
known but perhaps the function is to
allow flexible adaptation so on and so
forth I mean that really captures the
public imagination that holy this
isn't
uh I mean intelligent slash almost like
like an inkling of siience because like
you're thinking through you're
self-reflecting you're deliberating and
the final result of that after 157
seconds is humans instinctively convert
selfish desires into Cooperative systems
by collectively pretending abstract
rules money laws rights are real these
shared hallucinations act as quote games
where competition is secretly redirected
to benefit the group turning conflict
into society's fuel
pretty profound I mean you know this is
AAL digression but a lot of people have
found that these reasoning models can
sometimes produce much more eloquent
text that a at least interesting example
I think depending on how open minded you
are you find language models interesting
or not and there's a spectrum there well
I mean it's some of the we'll talk about
different benchmarks of s but some is
just a Vibe like that in itself is a
let's say quote fire tweet yeah if I
I'm trying to produce something
something where people are like oh
okay so that's CH thought we'll probably
return to it
more how are they able to achieve such
low cost on the training and the
inference maybe you could talk the
training first yeah so there's there's
two main techniques that they
implemented that are probably the
majority of their efficiency and then
there's a lot of implementation details
that maybe we'll gloss over or get into
later that sort of contribut to it but
those two main things are one is they
went to a mixture of experts model uh
which which we'll Define in a second and
then the other thing is that they
invented this new technique called MLA
lat and attention both of these are are
big deals mixture of experts is
something that's been in the literature
for a handful of years and open AI with
gp4 was the first one to productize a
mixture of experts model and what this
means is when you look at the common
models around uh that most people have
been able to interact with are open
right think llama llama is a dense model
I.E every single parameter or neuron is
activated as you're going through the
model for every single token you
generate right now with a mixture of
experts model you don't do that right
how how does a human actually work right
is like oh well my visual cortex is
active when I'm thinking about you know
Vision task and like you know other
things right my my amydala is when I'm
scared right these different aspects of
your brain are focused on different
things a mixture of experts model
attempts to approximate this to some
extent it's nowhere close to what a
brain architecture is but different
portions of the model activate right
you'll have a set number of experts in
the model and a set number that are
activated each time and this
dramatically reduces both your training
and inference cost because now you're
you know if you think about the
parameter count as the sort of total
embedding space for all of this
knowledge that you're compressing down
during
training when you're embedding this data
in instead of having to activate every
single parameter every single time
you're training or running inference now
you can just activate a subset and the
model will learn which expert to route
to for different tasks and so this is a
humongous innovation in terms of hey I
can continue to grow the total embedding
space of parameters and so deep seeks
model is you know 600 something billion
parameters right uh relative to llama
405b it's 405 billion parameters right
llama relative to llama 70b it's 70
billion parameters right so this model
technically has more embedding space for
information right to compress all of the
world's knowledge that's on the internet
down but at the same time it is only
activating around 37 billion of the
parameters so only 37 billion of these
parameters actually need to be computed
every single time you're training data
or inferencing data out of it and so
versus versus again the Llama model 70
billion parameters must be activated or
405 billion parameters must be activated
so you've dramatically reduced your
compute cost when you're doing training
and inference with this mixture of
experts architecture so we break down
where it actually applies and go into
the Transformer is that useful let's go
let's go into the Transformer the
Transformer is a thing that is talked
about a lot and we will not cover every
detail uh essentially the Transformer is
built on repeated blocks of this
attention mechanism and then a
traditional dense fully connected
multi-layer perception whatever word you
want to use for your normal neural
network and you alternate these blocks
there's other details and where mixture
of experts is applied is that this dense
model the dense model holds most of the
weights if you count them in a
Transformer model so you can get really
big gains from those mixure of experts
on parameter efficiency at training an
inference because you get this
efficiency by not activating all of
these parameters we should also say that
a Transformer is a giant neural network
yeah and then there's for 15 years now
there's What's called the Deep learning
Revolution networks gotten larger and
larger and at a certain point the
scaling laws appeared where people
realized this is a scaling law Shirt By
the way representing scaling laws where
it became more and more formalized that
bigger is better across multiple
dimensions of what bigger means so uh
and but these are all sort of neural
networks we're talking about and we're
talking about different architectures of
how construct to construct these neural
networks such that the training and the
inference on them is super efficient
yeah every different type of model has a
different scaling LW for it which which
is effectively for how much compute you
put in the architecture will get to
different levels of performance at test
tasks a mixture of experts is one of the
ones at training time even if you don't
consider the inference benefits which
are also big at training time your
efficiency with your gpus is
dramatically improved by using this
architecture if it is well implemented
so you can get effectively the same
performance model in evaluation scores
with numbers like 30% less compute I
think there's going to be a wide
variation depending on your
implementation details and stuff but it
is just important to realize that this
type of technical Innovation is
something that gives huge gains and I
expect most companies that are serving
their models to move to this mixture of
experts implementation historically the
reason why not everyone might do it is
because it's a implementation complexity
especially when doing these big models
so this is one of the things this deep
seek gets credit for is they do this
extremely well they do mixture of
experts extremely well this
architecture for what is called Deep
seek moee is the shortened version of
mixture of experts is multiple papers
old this part of their training
infrastructure is not new to these
models alone and same goes for what
Dylan mentioned with multi-ad lat and
attention this is all about reducing
memory usage during inference and same
things during training by using some
fancy low rank approximation math if you
get into the details with this latent
attention it's one of those things I
look at it's like okay this they're
doing really complex implementations
because there's other parts of language
model such as uh embeddings that are
used to extend the context length the
common one that deep seek used is Rotary
positional and pendings which is called
rope and if you want to use rope with a
normal Moe it's kind of a sequential
thing you take these you take two of the
attention matrices and you rotate them
by a complex value rotation which is a
matrix multiplication with deep seek MLA
with this new attention architecture
they need to do some clever things
because they're not set up the same and
it just makes the implementation
complexity much higher so they're
managing all of these things and these
are probably the sort of things that
open AI these closed labs are doing we
don't know if they're doing the exact
same techniques but they actually shared
them with the world which is really nice
to like this is The Cutting Edge of
efficient language model training and
some of this is requires low-level
engineering just is a giant mess and
trickery so as I understand they went
below Cuda so they go super low
programming of gpus effectively Nvidia
builds this Library called nickel right
uh in which you know when you're
training a model you have all these
communications between every single
layer of the model and you may have over
100 layers what does a nickel stand for
it's nccl Nvidia Communications
collectives Library nice um and so
D when when you're training a model
right you're going to have all these all
reduces and all gathers right uh between
each layer between the uh multier
perceptron or feed forward Network and
the attention mechanism you'll have
you'll have basically the model
synchronized right um or you'll have all
the you'll have all reducer and all
gather um and and this is a
communication between all the gpus in
the network whether whether it's in
training or inference so Nvidia has a
standard Library this is one of the
reasons why it's really difficult to use
anyone else's Hardware uh for training
is because no one's really built a
standard Communications Library um and
and nvidia's done this at a sort of a
higher level right a deep seek because
they have certain limitations around the
gpus that they have access to the
interconnects are limited to some extent
um by the restrictions of the gpus that
were shipped into China legally not the
ones that are smuggled but legally
shipped in uh that they used to train
this model they had to figure out how to
get efficiencies right and one of those
things is that instead of just calling
the Nvidia Library nickel right they
instead created their they scheduled
their own Communications uh which which
the lab some of the labs do right um em
meta talked about in llama 3 how they
made their own custom version of nickel
this is they didn't they didn't talk
about the implementation details this is
some of what they did probably not as
well as maybe not as well as deep seek
Because deep seek you know necessity is
the mother of innovation and they had to
do this whereas uh in the casa you know
open AI has people that do this sort of
stuff anthropic Etc uh but you know deep
seek certainly did it publicly and they
may have done it even better because
they were gimped on a certain aspect of
the chips that they have access to and
so they
scheduled
Communications um you know by scheduling
specific SMS SMS you could think of as
like the core on a GPU right so there's
hundreds of cores or there's you know a
bit over a 100 cores SMS on a GPU and
they were specifically scheduling hey
which ones are running the model which
ones are doing all reduce which one are
doing all gather right and they would
flip back and forth between them and
this requires extremely low-level
programming this is what nickel does
automatically or other Nvidia libraries
handle this automatically usually yeah
exactly and so so technically they're
using you know PTX which is like sort of
like you could think of it as like an
assembly type language it's not exactly
that or instruction set right like
coding directly to assembly or
instruction set it's not exactly that
but uh that's still part of technically
Cuda but it's like do I want to write in
Python you know pytorch equivalent and
call Nvidia libraries do I want to go
down to the ca level right or uh you
know and code even lower level or do I
want to go all the way down to the
assembly or Isa level and and there are
cases where you go all the way down
there at the very big Labs but most
companies just do not do that right
because it's a waste of time and the
efficiency gains you get are not worth
it but deep seeks implementation is so
complex right especially with their
mixture of experts right um people have
done mixture of experts but they're
generally 8 16 experts right and they
activate to so you know one of the words
we like Ed like to use is like sparsity
Factor right or usage right so so you
might have four you know one fourth of
your model activate right and and and
that's what Mist draws uh mixol model
right uh their model that really
catapulted them to like oh my God
they're really really good um openi has
also had models that aree and and so
have all the other labs that are major
closed but what deep seek did that maybe
only the leading Labs have only just
started recently doing is have such a
high sparity factor right it's not 1/4
of the model right two out of eight
experts activating every time you go
through the model it's eight out of 256
and there's different implementations
for mixture of experts where you can
have some of these experts that are ever
always activated which this just looks
like a small neural network and all the
tokens go through that and then they
also go through some that are selected
by this routing mechanism and one of the
Innovations in deep seeks architecture
is that they change the routing
mechanism in mixture of expert models
there's something called an auxiliary
loss which effectively means during
training you want to make sure that all
of these experts are used across the
tasks that the model sees why there can
be failur and mixture of experts is that
when you're doing this training the one
objective is token prediction accuracy
and if you just let toing go with a
mixture of expert model on your own it
can be that the model learns to only use
a subset of the experts and in thee
literature there's something called the
auxiliary loss which helps balance them
but if you think about the loss
functions of deep learning this even
connects to the bitter lesson is that
you want to have the minimum inductive
bias in your model to let the model
learn maximally and this auxiliary loss
this balancing across experts could be
seen as intention with the prediction
accuracy of the tokens so we don't know
the exact extent that the Deep seeke
change which is instead of doing an
auxiliary loss they have an extra
parameter in their routing which after
the batches they update this parameter
to make sure that the next batches all
have a similar use of experts and this
type of change can be big it can be
small but they add up over time and this
is the sort of thing that just points to
them innovating and I'm sure all the
labs that are training biges are looking
at this sort of things which is getting
away from the auxiliary loss some of
them might already use it but you just
keep you keep accumulating gains and
we'll talk about the philosophy of
training and how you organize these
organizations and a lot of it is just
compounding small improvements over time
in your data in your architecture and
your post trainining and how they
integrate with each other deep seek does
the same thing and some of them are
shared or a lot we have to take them on
face value that they share their most
important details I mean the
architecture and the weights are out
there so we're seeing what they're doing
and it adds up going back to sort of the
like efficiency and complexity point
right it's 32 versus a four right for
like mix draw and othere models that
have been publicly released so this
ratio is extremely high and sort of what
Nathan was getting at there was when you
have such a different level of sparsity
um you can't just have every GPU have
the entire model right the model's too
big there's too much complexity there so
you have to split up the model um with
different types of parallelism right and
so you might have different experts on
different GPU nodes but now what what
happens when a to you know this set of
data that you get hey all of it looks
like this one way and all of it should
route to one part of my you know model
right um so so when all of it rout
routes to one part of the model then you
can have the you can have this
overloading of a s certain set of the
GPU resources or certain set of the gpus
and then the rest of the the training
Network sits idle because all of the
tokens are just routing to that so this
is the biggest complexity one of the big
complexities with running a very you
know sparse mixture of experts model uh
I.E you know this 32 ratio versus this
uh four ratio is that you end up with so
many of the experts just sitting their
Idol so how do I load balance between
them how do I schedule the
communications between them this is a
lot of the like extremely low-level
detailed work that they figured out in
the public first and potentially like
second or third and the world and maybe
even first in some cases what uh lesson
do you uh in the direction of the better
lesson do you take from all of this
where is this going to be the direction
where a lot of the gain is going to be
which is this kind of lowlevel
optimization or is this a shortterm
thing where the biggest gains will be
more on the algorithmic high level side
of like posttraining is is this like a
short-term leap because they figured out
like a hack because constraints
Necessities the mother of invention or
is is there still a lot of gains I think
we should summarize what the bitter
lesson actually is about is I the bitter
lesson essentially if you paraphrase it
is that the types of training that will
win out in deep learning as we go are
those methods that which are scalable in
learning and search is what it calls out
and the scale word gets a lot of
attention in this the interpretation
that I use is effective
to avoid adding the human priors to your
learning process and if you read the
original essay this is what it talks
about is how researchers will try to
come up with clever solutions to their
specific problem that might get them
small gains in the short term while
simply enabling these deep Learning
Systems to work efficiently and for
these bigger problems in the long term
might be more likely to scale and
continue to drive success
and therefore we were talking about
relatively small implementation changes
to the mixture of experts model and
therefore it's like okay like we will
need a few more years to know if one of
these were actually really crucial to
the bitter lesson but the bitter lesson
is really this long-term Arc of how
Simplicity can often win and there's a
lot of sayings in the industry like the
models just want to learn you have to
give them the simple lost landscape
where you put compute through the model
and and they will learn and get barriers
out of the way that that's where the
power something like nickel comes in
where standardized code that could be
used by a lot of people to create sort
of simple innovations that can scale
which is why the hacks the I imagine the
code base for deep seek is probably a
giant mess I'm sure they have deep seek
definitely has code bases that are
extremely messy where they're testing
these new ideas multi-head late in
attention probably start could start in
something like a Jupiter notebook or
somebody tries something on a few gpus
and that is really messy but the stuff
that trains deep seek V3 and deep seek
R1 those libraries if you were to
present them to us I would guess are
extremely high quality code high quality
readable code I think there is one
aspect to note though right is that
there is the general General ability for
that to transfer across different types
of runs right you may make really really
high quality code for one specific model
architecture at one size
and then that is not transferable to hey
when when I make this architecture tweak
everything's broken again right like
that's that's something that could be uh
you know with their with their specific
low-l coding of like scheduling SMS is
specific to this model architecture and
size right and whereas like nvidia's
collectives library is more like hey
it'll work for anything right you want
to do an all reduce great I don't care
what your model architecture is it'll
work uh and you're giving up a lot of
performance when you do that uh in many
cases but it's it's worth for them to do
the specific uh optimization for the
specific run given the constraints that
they have regarding compute I wonder how
stressful it is to like you know these
Frontier models like initiate training
like to have the
code to push the button that like you're
now spending a large amount of money and
time to train this like there must I
mean there must be a lot of innovation
on the debugging stage of like making
sure there's no know issues that you're
monitoring and visualizing every aspect
of the training all that kind of stuff
when when people are training they have
all these various dashboards but like
the most simple one is your loss right
uh and it continues to go down but in
reality especially with more complicated
stuff likee the biggest problem with it
or FPA training which is another
Innovation you know going to a lower
Precision number format I.E less
accurate is that you end up with lost
bikes right and and no one knows why the
Lost bike happen and for long some of
them you do some of them you do some of
them are data I give a ai's example of
what blew up our earlier models is a
subreddit called microwave gang we love
to shout this out it's a real thing you
can pull up microwave gang essentially
it's a subreddit where everybody makes
posts that are just the letter M so it's
like so there's extremely long sequences
of the letter M and then the comments
are like beep beep because that's when
the microwave ends but if you pass this
into a model that's trained to be a
normal producing text it's extremely
high loss because normally you see an M
you don't predict M's for a long time so
like this is something that causes a l
spikes for us but when you have much
like this is this is old this is not
recent and when you have more mature
Data Systems that's not the thing that
causes the LW Spike and what Dylan is
saying is true but it's like it's it's
levels to this sort of idea with regards
to the stress right these people are
like you know you'll go out to dinner
with like a friend that works at one of
these labs and they'll just be they'll
just be like looking at their phone
every like 10 minutes and they're not
like you know it's one thing if they're
texting but they're just like like is
the Lost is the L
tokens tokens per second lost not blown
up they're just walking watching this
and the heart rate goes up if there's a
spike and some level of spikes is normal
right it'll it'll recover and be back
sometimes a lot of the old strategy was
like you just stop the run restart from
the old version and then like change the
data mix and then it keeps going there
are even different types of spikes so
Durk grenal has a theory A2 that's like
Fast spikes and slow spikes where there
are sometimes where you're looking at
the loss and there other parameters you
can see it start to creep up and then
blow up and that's really hard to
recover from so you have to go back much
further so you have the stressful period
where it's like flat or might start
going up and you're like what do I do
whereas there also law spikes that are
it looks good and then there's one spiky
data point and what you can do is you
just skip those you you see that there's
a spike you're like okay I can ignore
this data don't update the model and do
the next one and it'll recover quickly
but these like un trickier
implementations so as you get more
complex in your architecture and you
scale up to more gpus you have more
potential for your loss blowing up so
it's like there there's and there's a
distribution the whole idea of grocking
also comes in right it's like just
because it slowed down from improving
and loss doesn't mean it's not learning
because all of a sudden it could be like
this and it could just Spike down and
loss again because it learned truly
learned something right uh and it took
some time for it to learn that it's not
like a gradual process right and that's
that's what humans are like that's what
models are like so it's it's really a
stressful task as you mentioned and the
whole time the the the dollar count is
going up every company has failed runs
you need failed runs to push the
envelope on your infrastructure so a lot
of news Cycles are made of X company had
y failed run every company that's trying
to push the frontier of AI has these so
is yes it's noteworthy because it's a
lot of money and it can be weektoon
setback but it is part of the process
but how do you get if you're deep seek
how do you get to a place where holy
there's a successful combination of
hyper parameters a lot of small failed
runs and so so rapid uh iteration
through failed runs until and successful
ones you just and then you build a su
tuation like this this mixture of expert
works and then this implementation of
MLA works key hyper parameters like
learning rate and
regularization and things like this and
you find the regime that works for your
code base I've talking to people at
Frontier Labs there's a story that you
can tell where training language models
is kind of a path that you need to
follow so you need to like unlock the
ability to train a certain type of model
or a certain scale and then your code
base and your internal knoow what type
of parameters work for it is kind of
known and you look at the Deep seek
papers and models they' they've scaled
up they've added complexity and it's
just continuing to build the
capabilities that they have there
there's the concept of a YOLO run um so
YOLO you only live once um and and what
it is is like you know there's there's
there's all this experimentation you do
at the small scale right uh research
ablations right like you have your
jupyter notebook whether you're
experimenting with MLA on like three
gpus or whatever um and you're doing all
these different uh things like hey do I
do four expert four active experts 128
experts do I arrange the experts this
way you know all these different uh
model architecture things you're testing
at a very small scale right couple
researchers few gpus tens of gpus
hundreds of gpus whatever it is and then
all of a sudden you're like okay guys no
more no more around right uh no
more screwing around everyone take all
the resources we have let's pick what we
think will work and just go for it right
YOLO and this is where that sort of
stress comes in is like well I know it
works here but some things that work
here don't work here and some things
that work here don't work down here
right in this terms of scale right so
it's it's it's really truly a YOLO run
and and sort of like there is this like
like discussion of like certain
researchers just have like this
methodical nature like they can find the
whole search space and like figure out
all the ablations of different research
and really see what is best and there's
certain researchers who just kind of
like you know have that innate gut
instinct of like this is the Yol run
like you know looking at the data this
is it this is why you want to work in
post training because the GPU cost for
training is lower so you can make a
higher percentage of your training runs
Yol will runs yeah for for now yeah for
now for for now so some of this is
fundamentally luck still luck is skill
right in many cases yeah I mean it looks
lucky right when you're but the hill to
climb if you're on one of these labs and
you have an evaluation you're not
crushing there's a repeated Playbook of
how you improve things there are
localized improvements which might be
data improvements and these add up into
the whole model just being much better
and when you zoom in really close it can
be really obvious that this model is
just really bad at this thing and we can
fix it and you just add these up so like
some of it feels like look but on the
ground especially with these new
reasoning models we're talking to is
just so many ways that we can poke
around and normally it's that some of
them give big improvements the search
space is near infinite right and and yet
the amount of computing time you have is
is very low
and you're you're you have to hit
release schedules you have to not get
blown past by everyone otherwise you
know what happened with deep seek you
know crushing meta and mistr and coher
and all these guys they moved too slow
right they they maybe were too
methodical I don't know they didn't hit
the Yello run whatever the reason was
maybe they weren't as skilled uh
whatever what you know you can call it
luck if you want but at the end of the
day it's skill so 2025 is the year of
the YOLO run it seems like all the labs
are like going in I I I think it's even
more impressive what openi did in 2022
right at the time no one believed in
mixture of experts models right at
Google uh who had all the researchers uh
opening ey had such little compute and
they devoted all of their compute for
many months right all of it 100% for
many months to gp4 with a brand new
architecture with no belief that hey let
me spend a couple hundred million
dollars which is all of the money I have
on this model right that is truly YOLO
yeah right now now you know people like
all these like training run failures
that are in the media right it's like
okay great but like actually a lot huge
chunk of my GPS are doing inference I
still have a bunch doing research
constantly and yes my biggest cluster is
training but like on on this YOLO run
but like that YOLO run is much less
risky than like what opening I did in
2022 or maybe what deep seek did now or
you know like sort of like hey we're
just going to throw everything at it the
big Winners throughout human history are
the ones who are willing to do yellow at
some point okay uh what do we understand
about the hardware it's been trained on
deep seek deep seek is very interesting
this a second to take zoom out out of
who they are first of all right
highflyer is a hedge fund that has
historically done quantitative trading
in China as well as elsewhere and they
have always had a significant number of
gpus right in the past a lot of these
high frequency trading algorithmic Quant
Traders used fpgas uh but it shifted to
gpus definitely and there's both right
but gpus especially and deep and and
highflyer which is the hedge fund that
owns deep seek and everyone who works
for deep seek is part of highflyer to
some extent right uh it's same same
parent company same owner same CEO they
had all these resources and
infrastructure for trading and then they
devoted a humongous portion of them to
training models uh both language models
and otherwise right because these these
these te techniques were heavily AI
influenced um you know more recently
people have you know realized hey
trading with um you know like even even
when you you go back to like Renaissance
and all these all these like
quantitative firms natural language
processing is the key to like trading
really fast right understanding a press
release uh and making the right trade
right and so deep seek has always been
really good at this and even as far back
as 2021 they they have press releases
and papers saying like Hey we're the
first company in China with an a100
cluster this large those 10,000 a100
gpus right this is this is in 2021 now
this wasn't all for training you know
large language models this was mostly
for training models for their
quantitative aspects their or
quantitative trading as well as you know
a lot of that was natural language
processing to be clear right um and so
this is the sort of History right so
verifiable fact is that in 2021 they
built the largest chin uh cluster at
least they claim it was the largest
cluster in China 10,000 gpus before
export controls started yeah it's like
they've had a huge cluster before any
conversation of export controls so then
you step it forward to like what have
they done over the last four years since
then right um obviously they've
continued to operate the hedge fund
probably make tons of money and the
other thing is that they've leaned more
and more and more into AI the CEO Le CH
Fang uh Leon you're not putting me spot
on this we discuss this Leon Fang right
the CEO he own maybe Leon Fang he he
owns maybe a little bit more than half
the company allegedly right um is an
extremely like Elon Jensen kind of
figure where he's just like involved in
everything right um and so over that
time period he's gotten really in-depth
into AI he actually has a bit of a like
a if you if you see some of his
statements a bit of an eak Vibe almost
right total AGI Vibes like we need to do
this we need to make a new ecosystem of
open AI we need China to lead on this
sort of ecosystem because historically
the Western countries have led on on
software ecosystems and in straight up
acknowledges like in order to do this we
need to do something different deep seek
is his way of doing this some of the
translated interviews with him are so he
has done interviews yeah you think he
would do a western interview or no or is
there controls on there hasn't been one
yet but okay I would try it well I just
got a Chinese translator so it's great
this is this is all push um so
fascinating figure engineer pushing full
on into AI leveraging the success from
The High Frequency trading very direct
quotes like we will not switch to closed
Source when ask about this stuff very
long-term motivated in how the ecosystem
of AI should work and I think from a
Chinese perspective he wants the Chinese
company a Chinese company to build this
vision and so this is sort of like the
quote unquote Visionary behind the
company right this hedge fund still
exists right this this quantitative firm
and so deep seek is the sort of at at
you know slowly he got turned to this
full view of like AI everything about
this right but at some point it slowly
maneuvered and he made deep seek um and
deeps has done multiple models since
then they've acquired more and more gpus
they share infrastructure with the fund
right um and so you know there is no
exact number of public GPU resources
that they have but besides this 10,000
gpus that they bought in 2021 right and
they were fantastically profitable right
and then this paper claims they did only
2, h800 gpus which are a restricted GPU
that was previously allowed in China but
no longer allowed and there's a new
version but it's basically nvidia's h100
for China right um and there's some
restrictions on it specifically around
the communications uh sort of uh speed
that the interconnect speed right which
is why they had to do this crazy SM you
know scheduling stuff right so so going
back to that right let's like this is
obviously not true in terms of their
total GPU count obvious available gpus
but for this training run you think
2,000 is the correct number or no so
this is where it takes um you know
significant amount of sort of like
zoning in right like what do you call
your training run right do you count all
of the research and ablations that you
ran right picking all the stuff because
yes you can do a YOLO run but at some
level you have to do the test at the
small scale and then you have to do some
test at medium scale before you go to a
large scale accepted practice is that
for any given model that is a notable
advancement you're going to do 2 to 4X
compute of the full training run in
experiment alone so a lot of this Compu
that's being scaled up is probably used
in large part at this time for research
yeah and research will you know research
begets the new ideas that let you get
huge efficiency research gets you 01
like research gets you breakthroughs
then you need to bet on it so some of
the pricing strategy they will discuss
has the research baked into the price so
the numbers that deep seek specifically
said publicly right are just the 10,000
gpus in 2021 and then 2,000 gpus for
only the pre-training for V3 they did
not discuss cost on R1 they did not
discuss cost on all the other RL right
for the instruct model that they made
right they only discussed the
pre-training for the base model and they
did not discuss anything on research and
ablations and they do not talk about any
of the resources that are shared in
terms of hey the fund is using all these
gpus right and and we know that they're
very profitable and that 10,000 gpus in
in in 2021 so so the uh some some of the
research that we've found is that we
actually believe they have closer to
50,000 gpus we is sem so we should say
that you're uh sort of one of the world
experts in figuring out what everybody's
doing in terms of the Semiconductor in
terms of cluster build outs in terms of
like who's doing what in terms of
training runs so yeah so that's the Wii
okay go ahead yeah sorry sorry um we
believe they actually have something
closer to 50,000 gpus right now this is
this is split across many tasks right
again the fund um research in ablations
for ballpark how much would open AI or
anthropic had I think the clearest
example we have because meta is also
open they talk about like order of 60k
to 100K h100 equivalent gpus in their
training clusters right so so like llama
3 they said they trained on 16, h100s
right but the company of meta last year
publicly disclosed they bought like 400
something thousand gpus yeah right so so
of course Tiny percentage on the
training again like most of it is like
serving me the best Instagram reels
right um or whatever right I mean we
could get into a cost of like what is
the cost of ownership for a 2,000 GPU
cluster 10,000 like the there's just
different sizes of companies that can
afford these things and deep seek is
reasonably big their compute allocation
compared is one of the top few in the
world is not open Ai and Tropic Etc but
they have a lot of computer can you in
general actually just zoom out and also
talk about the the hopper architecture
the Nvidia Hopper GPU architecture and
the difference between h100 and h800
like you mentioned the interconnects
yeah so there's you know Amper was the
a100 and then h100 Hopper right people
use them synonymously in the US because
really there's just h100 and now there's
h200 right but same thing uh mostly in
China they've had two there have been
different salvos of export restrictions
so initially the US government limited
on a two- Factor scale right which is
Chip interconnect versus uh flops right
so any chip that had interconnects above
a certain level and flops above a
certain floating Point operations above
a certain level was restricted uh later
the government realized that this was a
flaw in the restriction and they cut it
down to just floating Point operations
and so um H h800 had high flops low
communication exactly so the h800 was
the same performance as h100 on flops
right but it didn't had it just had the
interconnect bandwidth cut deep seek
knew how to utilize this you know hey
even though we were cut back on the
interconnect we can do all this fancy
stuff to figure out how to use the GPU
fully anyways right and and so that was
back in October 2022 but uh later in
2023 end of 2023 implemented in 2024 the
US government banned the h800 right um
and so by the way this h800 cluster
these 2,000 gpus was not even purchased
in 2024 right it's purchased in late 202
um and they're just getting the model
out now right because it takes a lot of
research Etc um h800 was banned and now
there's a new chip called the H20 uh the
H20 is uh cut back on only flops but the
interconnect bandwidth is the same and
in fact in some ways it's better than
the h100 because it has better memory
bandwidth and memory capacity so there
are you know Nvidia is working within
the constraints of what the government
sets and then get builds the best
possible GPU for China can we take this
actual tangent and we'll return back to
the hardware is the the philosophy the
the motivation the case for export
controls what is it uh Dar amade just
published a blog post about export
controls the case he makes is that if AI
becomes super powerful and he says by
2026 we'll have AGI or super powerful Ai
and that's going to give a significant
whoever builds that will have a
significant military advantage and so
because the United States is is a
democracy and as he says China is uh
authoritarian or has authoritarian
elements you want a unipolar world where
the super powerful military because of
the AI is one that's a democracy it's a
much more complicated world
geopolitically when you have two
superpowers with super powerful Ai and
one is authoritarian so that's the case
he makes and so we want to uh the United
States wants to use export controls to
slow down to make sure that China can't
do these gigantic uh training
runs that would be presumably required
to build AGI this is very abstract I
think this can be the goal of how some
people describe export controls is this
super powerful AI there's and you
touched on the training run idea
there's not many worlds where China
cannot train AI models I think export
controls are knapping the amount of
compute or the density of compute that
China can have and if you think about
the AI ecosystem right now as all of
these AI companies Revenue numbers are
up and to the right the AI usage is just
continuing to grow more gpus are going
to inference a large part of export
controls if they work is just that the
amount of AI that can be run in China is
going to be much lower so on the
training side deep seek V3 is a great
example which you have a very focused
team that can still get to the frontier
of AI on this 2,000 gpus is not that
hard to get all considering in the world
they're still going to have those gpus
they're still going to be able to train
models but if there's going to be a huge
market for AI if you have strong export
controls and you want to have a 100,000
gpus just serving the equivalent of chat
GPT custers with good export controls it
also just makes it so that e AI can be
used much less and I think that is a
much easier goal to achieve than trying
to debate on what AGI is and if you have
these extremely intelligent autonomous
AIS and data centers like those are the
things that could be running in these
GPU clusters in the United States but
not in China to some extent training a
model does effectively nothing right
like TR to have a model the the thing
that Dario is sort of speaking to is the
implementation of that model once
trained to then create huge economic
growth huge increases in military
capabilities huge capabil increases in
productivity of people uh betterment of
lives whatever whatever you want to
direct super powerful AI towards you can
but that requires a significant amount
compute right and so the US government
has effectively said um and and and and
forever right like train training will
always be a portion of the total compute
um you know we mentioned meta 400,000
gpus only 16,000 made llama right so the
the percentage that meta is dedicating
to inference now this might be for
recommendation systems that are trying
to hack our mind into spending more time
and watching more ads or if it's if it's
or if it's for a super powerful AI
That's doing productive things doesn't
matter about the exact use that our you
know economic system decides it's that
that can be delivered whatever in
whatever way we want whereas with China
right you know you're you know export
restrictions great you're never going to
be able to cut everything off right uh
and that's that's like I think that's
quite well understood by the US
government uh is that you can't cut
everything off um you know and they'll
make their own chips and and they're
trying to make their own chips they'll
be worse than ours but you know this is
the whole point is to just keep a gap
right um and therefore at some point as
the AI you know in a world where 2 3%
economic growth this is really dumb by
the way right to cut off uh you know
high-tech and make money off of it but
in a world where super powerful AI comes
about and then starts creating
significant changes in society which is
what all the AI leaders and big tech
companies believe I think super powerful
AI is going to change society massively
and therefore this compounding effect of
the difference in compute is really
important there's some sci-fi out there
where like AI is T is like measured in
the power of in like how much power is
delivered to compute right or how much
uh is being you know that's sort of a
way of thinking about what's the
economic output is just how much power
you direct in towards that AI should we
talk about reasoning models with this as
a way that this might be actionable as
something that people can actually see
so the reasoning models that are coming
out with R1 and o1 they're designed to
use more compute there's a lot of Buzzy
words in the AI Community about this
test time compute inference time compute
whatever but um Dylan has good research
on this you can get to the specific
numbers on the ratio of when you train a
model you can look at things about the
amount of compute used at training and
amount of compute used at inference
these reasoning models are making
inference way more important to doing
complex tasks in the fall in December
their open AI announced this 03 model
there another thing in AI when things
move fast we get both announcements and
releases announcements are essentially
blog posts where you pat yourself on the
back and you say you did things and
releases are R the models out there the
papers out there Etc so open AI has
announced 03 I we can check if 03 mini
is out as of recording potentially but
that doesn't really change the point
which is that the Breakthrough result
was something called Arc AGI task which
is the abstract reasoning Corpus a task
for artificial general intelligence um
Fran chle is the guy who's been it's
it's a multi-year old paper it's a
brilliant Benchmark and the number for
openai 03 to solve this was that it used
a some sort of number of samples in the
API the API has like thinking effort and
number of samples they used a thousand
samples to solve this task and it comes
out to be
like five to $20 per question which
you're you're putting in effectively a
math puzzle and then it takes orders of
dollars to answer one question and this
is a lot of compute if this is going to
take off in the US Open AI needs a ton
of gpus on inference to capture this
they have this um open AI chat gbt Pro
subscription which is $200 a month which
Sam said they're losing money on which
means that people are burning a lot of
gpus on inference and I've signed up
with it I've played with it I don't
think I'm a power user but I I I use it
and it's like that is the thing that a
Chinese company with mediumly strong
expert controls there will always be
loopholes might not be able to do it all
and if that the main result for 03 is
also a spectacular coding performance
and if that feeds back into AI companies
being able to experiment better so
presumably the idea is for an
AGI a much larger fraction of the compu
would be used for this test time
computer for the reasoning for the AGI
goes into a room and thinks about how to
take over the world and that you know
come back in 2.7 hours this is what
going to take a lot of computer this is
what people like CEO or leaders of open
Ai and anthropic talk about is like
autonomous AI models which is you give
them a task and they work on it in the
background I think my personal
definition of AGI is much simpler like I
I think language models are a form of
AGI and all this super powerful stuff is
a next step that's great if we get these
tools but a language model has so much
value in so many domains it is a general
intelligence to me but this next step of
agentic things where they're independent
and they can do tasks that aren't in the
training data is what the fewe Outlook
that these AI companies are driving for
I think the terminology here that Dar
Dario uses as super powerful AI so I
agree with you on the AGI I think we
already have something like that's
exceptionally impressive that Allan
toring would for sure say is Agi but
he's referring more to something once in
possession of then you would have a
significant military and geopolitical
advantage over other nations so it's not
just like
you can ask it how to cook an omelet and
he has a much more positive view in his
essay Machines of love and grace I've
read into this I don't have enough
background in physical sciences to gauge
exactly how confident I am and if AI can
revolutionize biology but I am safe
saying that AI is going to accelerate
the progress of any computational
science so we're doing a depth for
search here on topics uh taking tangent
of a tangent so let's continue uh on
that depth first search
uh you said that you're both feeling the
AGI so you're what's what's your
timeline Dario 2026 for the super
powerful AI That's you know that's
basically agentic to a degree where it's
a real security threat that level of AGI
what's your what's your timeline I don't
like to attribute specific abilities
because predicting specific abilities
and when is very hard I think mostly if
you're going to say that I'm feeling the
AGI is that I expect continued rapid
surprising progress over the next few
years so something like R1 is less
surprising to me from Deep seek because
I I expect there to be new paradigms
where substantial progress can be made
and deep seek R1 is so unsettling
because we're kind of on this path with
with chat gbt it's like it's getting
better it's getting better it's getting
better and then we have a new direction
for for changing the models and we took
one step like this and we like took a
step up so it looks like a really fast
St slope and then we're going to just
take more steps so like it's just really
unsettling when you have these big steps
and I expect that to keep happening I
see I've tried openingi operator I've
tried CLA computer use they're not there
yet I understand the idea but it's just
so hard to predict what is the
Breakthrough that will make something
like that work and I think it's more
likely that we have breakthroughs that
work and things that we don't know what
they're going to do so like everyone
wants agents Dario has very eloquent way
of describing this and I just think that
it's like there's going to be more than
that so like just expect these things to
come I'm going to have to try to pin you
down to a date on the AGI timeline uh
like the nuclear weapon moment so moment
where on the geopolitical
stage there's a real like you know CU
we're talking about export controls when
do you think just even to throw out a
date when do you think that would be
like for me it's probably after 2030 so
I'm not as that's what I would say so so
Define that right because to me it kind
of almost has already happened right you
look at elections in India and Pakistan
people get AI voice calls and think
they're talking to the politician right
the AI diffusion rules which was enacted
in the last couple weeks of the Biden
admin and looks like the Trump admin
will keep and potentially even
strengthen limit cloud computing and GPU
sales to countries that are not even
related to China it's like this is
Portugal and all these like normal comp
countries are on the you need approval
from the US list like yeah Portugal and
like you know like like all these
countries that are allies right
Singapore right like they they freaking
have f35s and we don't let them buy gpus
like this is this to me is already to
the scale of like you know well that
just means that uh the US military is
really nervous about this new technology
that doesn't mean the technolog is
already there so like they might be just
very cautious about this thing that they
don't quite understand but that's a
really good point sort of the the rooc
calls swarms of semi-intelligent bots
could be a weapon could be doing a lot
of social engineering I mean there's
tons of talk about you know from the
2016 elections like Cambridge analytica
and all this stuff Russian influence I
mean every country in the world is
pushing stuff onto the internet and has
narrative they want right like that's
every every like technically competent
whether it's Russia China us Israel Etc
right you know people are pushing
viewpoints onto the internet and mass
and language models crash the cost of
like very intelligent sounding Lang
there's some research that shows that
the distribution is actually the
limiting factor so language models
haven't yet made missing
information
particularly like change the equation
there the internet is still ongoing I
think there's a Blog AI snake oil and
some of my friends that prints in that
write on this stuff so there is research
it's like it's a default that everyone
assumes and I would have thought the
same thing is that misinformation
doesn't get far worse with language
models I think in terms of Internet
posts and things that people have been
measuring it hasn't been a exponential
increase or something extremely
measurable in things you're talking
about with like voice calls and stuff
like that it could be in modalities that
are harder to measure so it's it's
something that it's too soon to tell in
terms of I think that's like political
instability via the web is very it's
it's monitored by a lot of researchers
to see what's happening I think the
you're asking about like the AGI thing I
my if you make me give a year I would be
like okay I have ai CEOs saying this
they've been saying two years for a
while I think that they're people like
Dario anthropic the had thought about
this so deeply I need to take their
words seriously but also understand that
they have differ different incentive so
I would be like add a few years to that
which is how you get something similar
to 2030 or a little after 2030 I think
to some extent we have capabilities that
hit a certain point where any one person
could say oh okay if I can leverage
those capabilities for x amount of time
this is Agi right call it 2728 but then
the cost of actually operating that
capability yeah this is going to be my
point so so extreme that no one can
actually deploy it at scale and Mass to
actually completely revolutionize the
economy on a click on a snap of a finger
so I don't think it will be like a snap
of the finger moment physical constraint
rather it'll be a you know oh the
capabilities are here but I can't deploy
it everywhere right and so one one
simple example going back sort of to
2023 was when uh you know being with gp4
came out and everyone was freaking out
about search right perplexity came out
if you did the cost on like hey
implementing gpt3 into every Google
search was like oh okay this is just
like physically impossible to implement
right and and and as we step forward to
like going back to the test time compute
thing right a query for you know you ask
chat GPT a question it costs cents right
for their most capable model of chat
right to get a query back to solve an
arc AGI problem though cost five to 20
bucks right and this is this is an a
it's only going up from there this is a
th000 10,000 X Factor difference in cost
to respond to a query versus do a task
and the task of AGI is not like it's
like it's it's simple to some extent um
you know but it's also like what are the
tasks that we want a okay AGI quote
unquote what we have today can do Arc
AGI three years from now it can do much
more complicated problems but the cost
is going to be measured in thousands and
thousands and hundreds of thousands of
dollars of GPU time and there just won't
be enough power gpus infrastructure to
operate this and therefore shift
everything in the world on the snap the
finger but at that moment who gets to
constu control and point the AGI at a
task and so this was in Dario's post
that he's like hey China can effectively
and more quickly than us Point their AGI
at military tasks right and they have
been in many ways faster at adopting
certain new technologies into into their
military right especially with regards
to drones right uh the us maybe has a
long-standing you know large air sort of
you know fighter jet type of thing
bombers but when it comes to asymmetric
arms such as drones they've they
completely leapfrogged the US and the
west and the the fear that Dario is sort
of pointing out there I think is that
yeah great we'll have AGI in the
commercial sector uh the US military
won't be able to implement it super fast
Chinese military could and they could
direct all their resources to
implementing it in the military and
therefore solving you know military
logistics or solving some some other
aspect of like disinformation for
targeted certain set of people so they
can flip a country's politics or
something like that that is actually
like catastrophic versus you know the US
just wants to you know because it'll be
more capitalistically allocated just
towards whatever is the highest return
on income which might be like building
you know factories better or whatever so
everything I've seen uh people's
intuition seems to fail on robotics so
you have this kind of General optimism
I've seen this on self-driving cars
people think it's much easier problem
than it is similar with drones here I
understand it a little bit less but I've
just seen the reality of the war in
Ukraine and the usage of drones on both
sides and it seems that humans still far
outperform any any fully autonomous
systems AI is an assistant but humans
Drive fpv drones where the humans
controlling most of it just far far far
outperforms AI system so I think it's
not obvious to me that we're going to
have swarms of autonomous robots anytime
soon in the military context maybe the
the fastest I can imagine is 2030 which
is why I said 2030 for the super
powerful AI whenever you have large
scale swarms of robots doing military
actions that's when the world just
starts to look different to me so that's
the thing I'm really worried about but
there could be cyber
War cyber War type of technologies that
uh from social engineering to actually
just swarms of robots that find attack
vectors in our code bases and shut down
P grids that kind of stuff and it could
be one of those things like on any given
weekend or something power goes out
nobody knows why and the world changes
forever just power going out for two
days in all of the United States that
will lead to murder to
chaos but going back to export controls
do you see that as a useful
way to
uh control the balance of power
geopolitically in the context of AI and
I think going going back to my viewpoint
is if you believe we're in the sort of
uh stage of economic growth and change
that we've been in for the last 20 years
the export controls are absolutely
guaranteeing that China will win long
term right if you do not believe AI is
going to make significant changes to
society in the next 10 years or five
years right five fiveyear timelines are
sort of what the more Executives and
such of AI companies and even big tech
companies believe but even 10 your
timelines you know it's reasonable but
once you get to hey these these
timelines are uh below that time period
then the only way to sort of like create
a sizable advantage or disadvantage for
America versus China is if you constrain
compute because Talent is not really
something that's constraining right
China arguably has more Talent right
more stem graduates more programmers the
US can draw upon the world's people
which it does there's of you know
foreigners in the AI industry so many of
these AI teams are all people without a
US passport yeah yeah I mean many of
them are are are Chinese people who are
moving to America right and that's
that's great that's exactly what we want
right um but there's that Talent is one
aspect but I don't think that's one that
is a measurable Advantage for the us or
not it truly is just whether or not
compute right now even on the compute
side uh when we look at chips versus
data centers right China has the
unprecedented ability to build
ridiculous sums of power Clockwork right
they're always building more and more
power they've got steel mills that that
like individually are the size of the
entire us industry right and they've got
aluminum Mills that consume gigawatts
and gigawatts of power right and when we
talk about what's the biggest data
center right open ey made this huge
thing about Stargate their announcement
there that's not that's like once it's
fully built out in a few years it'll be
2 GW right of power right and this is is
still smaller than the largest you know
industrial facilities in China right
China if they wanted to build the
largest data center in the world if they
had access to the chips could so it's
not just it's just a question of uh when
not if right so their industrial
capacity far exceeds the United States
exactly to the the manufactur stuff so
why why so longterm they're going to be
manufacturing chips there chips are a
little bit more specialized I'm
specifically referring to the data
centers right chips Fabs take huge
amounts of power don't get me wrong
uh that's not necessarily the gating
Factor there the gating Factor on how
build fast people can build the largest
clusters today in the US is power right
it is whether it's now it could be power
generation power transmission uh
substations and uh you know uh all these
sorts of Transformers and all these
things uh building the data center these
are all constraints on the US industry's
ability to build F larger and larger
Training Systems as well as deploying
more and more inference comput I think
we need to make the point clear on why
the is now for people that don't think
about this because essentially with
export controls you're making it so
China cannot make or get um Cutting Edge
chips and the idea is that if you time
this wrong China is pouring a ton of
money into their chip production and if
you time it wrong they are going to have
more capacity for production more
capacity for energy and figure out how
to make the chips and have more capacity
than the rest of the world to make the
chips because everybody can buy they're
going to sell their Chinese Chips to
everybody they might subsidize them and
therefore if AI takes a long time to
become differentiated we've knapped the
financial performance of American
companies Nvidia can sell less tsmc
cannot sell to China so therefore we
have less demand to therefore in to like
keep driving the production cycle so
that's the Assumption behind the time
timing being less than 10 years or five
years to above right China will win
because of these restrictions long term
unless AI does something in the short
term which I believe AI will do you know
make massive changes to society in the
medium short term right um and so that's
that's the big unlocker there um and
even even today right if xingping
decided to get you know quote unquote
scale pilled right uh I.E decide that
scaling laws are what matters right just
like the US Executives like Sacha
Nadella and Mark Zuckerberg and and and
Sundar and all these us Executives of
the biggest most powerful tech companies
have decided their scale pild and
they're building multi- gigawatt data
centers right whether it's in Texas or
Louisiana or Wisconsin whatever wherever
it is they're building these massive
things that cost as much as their entire
budget for spending on data centers
globally in one spot right there this is
what they've committed to for next year
year after Etc and and so they're so
convinced that this is the way that this
is what they're doing but if China
decided to they could do it faster than
us but this is this is where the
restrictions come in it is not clear
that China as a whole has decided you
know from the highest levels that this
is a priority the US sort of has right
uh you know you see Trump talking about
deep seek and uh Stargate within the
same week right so he's and the Biden
End Men as well had a lot of discussions
about Ai and and and such uh it's clear
that they think about it only just last
week did deep seek meet the second in
command of China right like they have
not even met the top right they haven't
met G she hasn't set down and and and
they only just released a subsidy of a
trillion R&B uh you know roughly $160
billion um which is close to the
spending of like Microsoft and meta and
Google combined right for this year so
it's like they're they're they're
realizing it just now but that's where
the export restrictions come in and say
hey you can't you can't ship the most
powerful us chips to China uh you can
ship a cutdown version you can you can't
ship the most um powerful chips to all
these countries who we know are just
going to rent it to China uh you have to
limit the numbers right and the tools
and same with manufacturing equipment
tools all these all these different
aspects but stems from Ai and then what
Downstream can slow them down in Ai and
so the the entire semiconductor
restrictions you read them they are very
clear it's about Ai and Military civil
Fusion of Technology right there it's
very clear and then from there it goes
oh well we're Banning them from buying
like lithography tools and etch tools
and deposition tools and oh this random
like you know subsystem from a random
company that's like tiny right like why
are we Banning this because all of it
the US government has decided is
critical to AI systems I think the the f
point is like the transition from 7
nanometer to 5 nanometer chips where I
think it was Huawei that had the 7
nanometer chip a few years ago which
caused another political brewhaha almost
like this moment and then it's like asml
deep euv what is that like extreme
ultraviolet lithography to set context
on the chips right what Nathan's
referring to is in 2020 Huawei released
their asend 910 chip uh which was an ai
ai chip first one on 7 nmet before
Google did before Nvidia did and they
submitted it to The mlpf Benchmark which
is sort of a industry standard for
machine learning performance Benchmark
um and and it did quite well and it was
the best chip at the submission right
this was this was a huge deal um the
Trump admin of course banned um it was
2019 right banned the Huawei from
getting 7 nanometer chips from tsmc and
so then they had to switch to move using
internal domestically produced chips
which was a multi-year setback many
companies have done seven nanometer
chips and the question is like we don't
know how much
Huawei was subsidizing production of
that chip like Intel has made Seven
nanometer chips that are not profitable
and things like this so this is how it
all feeds back into the economic engine
of export controls well so you're saying
that for now xiin Bing has not felt the
AGI but it feels like the Deep seek
moment yeah might like there might be
meetings going on now where he's going
to start wearing the same t-shirt and
things are going to escalate I mean like
like this he may have woken up last week
right Leon Fang met the vice chair Vice
the second command guy um and they had a
meeting and then the day the next day
they announced the AI subsidies which
are trillion R&B right so it's possible
that this deep seek moment is truly the
beginning of a cold war that's what a
lot of people are worried about people
in AI have been worried that this is
going towards a cold war or already is
but there was it's not deep seeks fault
but there's something a bunch of factors
came together where it was explosion I
mean it all has to do with stop going
down prob but it's just some like Mass
hysteria that happened that eventually
led to shing ping having meetings and
waking up to this idea and the US
government realized in October 7th 2022
before chat GPT released that that
restriction October 7th which dropped
and shocked everyone and it was very
clearly aimed at AI everyone was like
what the heck are you doing diffusion
was out then but not tragic BT yeah but
not tragic so like starting to be
Rumblings like what gen can do to
society but it was very clear I think to
at least like National Security Council
and and those sort of folks that this
was where the world is headed this cold
war that's happening so is there any
concerns that the uh export
controls push
China to uh take military action on
Taiwan this is this is the big risk
right the further you push China away
from having access to you cutting edge
American and Global Technologies the
more likely they are to say well well
cuz I can't access it I might as well
like no one should access it right um
and there's a few like interesting
aspects of that right like you know
China has a urban rural divide like no
other um they have a male female birth
ratio like no other to the point where
you know if you look in most of China
it's like the ratio is not that bad but
when you look at single dudes in rural
China it's like a 30 to1 ratio um and
those are disenfranchised dudes right
like uh quote unquote like the US has an
incel problem like China does too it's
just their fated in some way or cut
crushed down what do you do with these
people and at the same time you're not
allowed to access the most important
technology at least the US thinks so
China is maybe starting to think this is
the most important technology uh by
starting to dump subsidies in it right
they thought EVs and Renewables were the
most important technology they dominate
that now right uh now they're starting
to they started thinking about that
about semiconductors in you know the
late 2010s and early 2020s and now
they've been dumping money and they're
catching up rapidly um and they're going
to do the same with AI right because
they're very talented right so so uh the
the question is like when when does when
when does when does this hit a Breaking
Point right um and if China sees this as
hey they can continue if they if not
having access and starting a true Hot
War right taking over Taiwan or trying
to subvert its democracy in some way or
blockading it um hurts the rest of the
world far more than it hurts them this
is something they could potentially do
right and and so is this pushing them
towards that uh potentially right I'm
not quite a geopolitical person but you
know it it's it's obvious that the world
regime of peace and like trade is like
super awesome for economics uh but but
at some point it could break right I
think we should comment that the like
why Chinese economy would be hurt by
that is that their export heavy I think
the United States Buys so much like if
that goes away like that's how their
economy well also also they just like
would not be able to import raw
materials from like all over the world
right the US would just shut down the
straight of malaka and like you know at
the same time the US entire Like You
could argue almost all the GDP growth in
America since you know the 70s has been
either population growth or Tech right
um because you know your the your life
today is not that much better than
someone from the 80s outside of tech
right you still you know you know cars
they all have semiconductors in them
everywhere fridges semiconductors
everywhere there's these funny stories
about how Russians were taking apart
laundry machines because they had
certain like Texas instrument chips that
they they could then repurpose and put
into to like their um their anti-missile
missile things right like their S400 or
whatever you would know more about this
but uh there's all sorts of like
everything about semiconductors is so
integral to every part of our lives so
can you explain the role of tsmc in the
story of semiconductors and uh maybe
also how the United States can break the
Reliance on tsmc I don't think it's
necessarily breaking the Reliance I
think it's uh getting tsmc to you know
build in the US uh but so so so taking a
step back right tsmc produces most of
the world's chips right especially on
The Foundry side um you know there's a
lot of companies that build their own
chips uh Samsung Intel um you know St
micro Texas Instruments you know Analog
Devices all these kinds of companies
build their own chips n XP but more and
more of these companies are Outsourcing
to tsmc and have been for multiple
decades can you explain the the supply
chain there and where most of tsmc is in
terms of manufacturing sure so
historically supply chain was companies
would build their own chips they would
you know it be a company started uh
they'd build their own chips and then
they they design the chip and build the
ship and sell it um over time this
became really difficult because the cost
of building a Fab continues to compound
every single generation of course the
technology figuring out the technology
for it is incredibly difficult
regardless but just the dollars and
cents that are required ignoring you
know saying hey yes I have all the
technical capability which it's really
hard to get that by the way right
Intel's fa Samsung's failing Etc um but
if you look at just the dollars to spend
to build that next Generation Fab it
keeps growing right sort of like you
know mors laws having the cost of chips
every two years there there's a separate
law that's sort of like doubling the
cost of Fabs every handful of years and
so you look at a Leading Edge Fab that
is going to be profitable today that's
building you know three nanometer chips
or two nanometer chips in the future
that's going to cost north of 3040
billion right um and that's just for
like a token amount that's for like
that's like the base building block and
you probably need to build multiple
right and so when you look at the
industry uh over the last you know if I
go back 20 30 years ago there were 2030
companies that could build the most
advanced chips and then they would
design them themselves and sell them
right so companies like AMD would build
their own chips uh Intel of course still
builds their own chips they're very
famous for but IBM would build their own
chips and you know you could keep going
down the list all these companies built
their own chips slowly they kept falling
like flies and that's because of what
tsmc did right they created The Foundry
business model which is I'm not going to
design any chips I'm just going to
contra manufacturer chips for else other
people um and one of their early
customers is NVIDIA right Nvidia was is
is the only Semiconductor Company uh
that's worth you know that's doing more
than a billion dollars of Revenue that
was started in the era of Foundry right
every other company started before then
and at some point had Fabs which is
actually incredible right um you know
like AMD and Intel and
broadcom it's like everyone had Fabs at
some point or you know BR you know some
companies like broadcom it was like a
merger or amalgamation of various
compies that rolled up but even today
broadcom has Fabs right they build
iPhone uh RF radio chips sort of in
Colorado for for you know for Apple
right like there's there all these
companies had Fabs and for most of the
Fabs they threw them away or sold them
off or they got rolled into something
else uh and now everyone relies on tsmc
right including Intel their latest PC
chip uses tsmc chips right it also uses
some Intel chips but it uses tsmc
process can you explain why the foundry
model is so successful for these
companies why why why are they going
with economies of scale scale yeah so so
I mean like like I mentioned right the
cost of building a Fab is so high the
R&D is so difficult um and uh when you
look at like these like companies that
had their own vertical stack there was
an Antiquated process of like okay like
I'm so hyper customized to each specific
chip right but as we've gone through the
history of sort of like the last 50
years of of electronics and
semiconductors a you need more and more
specialization right because Mo's law
has died um dard scaling has died IE
chips are not getting better just for
free right you know from manufacturing
you have to make real architectural
Innovations right Google is not just
running on Intel CPUs for web serving
they have a YouTube chip they have tpus
they have pixel chips they have a wide
diversity of chips that uh you know
generate all the economic value of
Google right running you know it's
running all the services and stuff and
so and this is just Google and you could
go across any company in the industry
and it's like this right cars contain
5,000 chips you know 200 different
varieties of them right all these random
things a Tesla door handle has two chips
right like it's like ridiculous um and
it's a cool door handle right it's like
you know you don't think about it but
it's like has two really chip like like
Penny like chips in there right anyway
so so as you have more diversity of
chips as you have more specialization
required and the cost of Fabs continues
to grow you need someone who is laser
focused on building the best process
technology and making it as flexible as
possible I I think you can say it simply
which is the cost for Fab goes up and if
you are a small player that makes a few
types of chips you're not going to can
have the demand to pay back the cost of
the Fab whereas Nvidia can have many
different customers and aggregate all
this demand into one place and then
they're the only person that makes
enough money building chips to buy the
next to build the next Fab so this is
kind of why they the companies slowly
get killed because they have a they have
10 years ago a chip that is profitable
and is good enough but the cost to build
the next one goes up they may try to do
this fail because they don't have the
money to make it work and then they
don't have any chips or they build it
and it's too expensive and they just
have you there's more failure points
right you know you could have one little
process related to like some sort of
like uh chemical etch or some sort of
like plasma etch or you know some little
process that screws up you didn't
engineer it right and now the whole
company falls apart you can't make chips
right and so super super powerful
companies like Intel they had like the
weathering storm to like hey they still
exist today even though they really
screwed up their manufacturing six seven
years ago but in the case of like AMD
they almost went bankrupt they had to s
their Fabs to mubadala uh UAE right um
and and like that became a separate
company called Global foundaries which
is a Foundry firm um and and then AMD
was able to then focus on like on the
return back up was like hey let's focus
on making chiplets and a bunch of
different chips for different markets um
and focusing on specific workloads
rather than you know all of the these
different things and so you get more
diversity of chips you have more
companies than ever designing chips but
you have fewer companies than ever
manufacturing them right and this is
this is where tsmc comes in is they've
they've just been the best right they
are so good at it right they're customer
focused they make it easy for you to
fabricate your chips they take all of
that complexity and like kind of try and
Abstract a lot of it away from you um
they make good money they don't make
insane money but they make good money um
and and they're able to aggregate all
this demand and continue to build the
next Fab the next Fab the next Fab so
why is Taiwan so special for tsmc why is
it happening there can it be replicated
inside the United States yeah so there's
there's aspects of it that I would say
yes and aspects that I'd say no right um
tsmc is way ahead because uh former you
know executive Morris Chang of Texas
Instruments uh wasn't promoted to CEO
and he's like screw this I'm going to go
make a my own chip company right and he
went to Taiwan and made tsmc right and
there's there's a whole lot more story
there um so he it could have been Texas
Instruments could have been the T you
know could have been tsmc but Texas
semiconductor manufacturing company
right instead of you know Texas
Instruments right but but you know so
there is that whole story there but the
sitting here in Texas I mean and that
sounds like a human story like it didn't
get promoted just the Brilliance of
Morris changen you know which I wouldn't
underplay but there's also like a
different level of like how how this
works right so in
Taiwan the you know like the number top
percent of graduates of students that go
to the best school which is n the top
percent of those all go work to tsmc
right and and guess what their pay is
their starting pay is like $80,000
$70,000 right which is like that's like
starting pay for like a good graduate in
the US right not not the top the top
graduates are making hundreds of
thousands of dollars at the Googles and
the Amazon and now I guess the open AI
of the world right um so so there is
there is a large dichotomy of like what
is the top 1% of the society doing and
where are they headed because of
economic reasons right Intel never paid
that crazy good right um and and it
didn't make sense to them right that's
that's one aspect right where is the
best going second is the work ethic
right like you know we we like to work
you know you work a lot we work a lot
but at the end of the day um when
there's a you know when when what what
is the time and amount of work that
you're doing and what does a Fab require
right Fabs are not work from home jobs
they are you go into the Fab and
grueling work right um there's there's
hey if there is any amount of vibration
right an earthquake happens vibrates the
machines they're all you know they're
either broken you've Lo you've scrapped
some of your production and then in many
cases they're like not calibrated
properly so so when tsmc when there's an
earthquake right recently there's been
an earthquake tsmc doesn't call their
employees they just they just go to the
Fab and like they just show up the
parking lot gets slammed and people just
go into the Fab and fix it right like
it's like an arm it's like ants right
like it's like you know a hive of ants
doesn't get told by the queen what to do
the ants just know it's like one person
just specializes on this one task and
it's like you're going to take this one
tool and you're the best person in the
world and this is what you're going to
do for your whole life is this one task
in the Fab which is like some special
chemistry plus Nano manufacturing on one
line of tools that continues to get
iterated and yeah it's just like it's
like specific plasma etge for removing
silicon dioxide right that's all you
focus on your whole career and it's like
such a specialized thing and and so it's
not like the task are transferable AI
today is awesome because like people can
pick it up like that uh semiconductor
manufacturing is is very Antiquated and
difficult none of the materials are
online for people to read easily uh and
learn right the papers are very dense
and like it takes it takes a lot of
experience to learn and so it makes the
barrier to entry much higher too so so
when you talk about hey you have all
these people that are super specialized
they will work you know 80 hours a week
in a factory right in a Fab and if
anything goes wrong they'll go show up
in the middle of the night because some
earthquake their wife's like there was
an earthquake he's like great I'm going
to go to the Fab C would you would you
like as an American do that right it's
like these sorts of things are like what
you know I guess are the exemplifying
like why tsmc is so amazing now can you
replicate it in the US uh let's not
ignore intel was the leader in
manufacturing for over 20 years they
brought every technology to Market first
besides UV strain silicon High K metal
gates finfet um you know you the list
goes on and on and on of technologies
that Intel brought to Market first made
the most money from um and and and
manufactured at scale first best highest
profit mergence right so we shouldn't
ignore that Intel can't do this right
it's that the culture uh has broken
right um you've invested in the wrong
things they said no to the iPhone they
they had all these different things
regarding like you know mismanagement of
Fabs mismanagement of designs this
lockup right and at the same time all
these brilliant people right these like
50,000 phds uh you know or or Masters
that have been working on specific
chemical or physical processes or nanom
manufacturing processes for decades in
Oregon they're still there they're still
producing amazing work it's just like
getting it to the last mile of
production at high yield where you can
design where you can manufacture dozens
and hundreds of different kinds of chips
you know and and it's good you customer
experience has broken right you know
it's that customer experience it's like
the like part of it is like people will
say intel was too pompous in the 2000s
2010s right they just thought they were
better than everyone the tool guys were
like oh I don't think that this this is
mature enough and they're like ah you
just don't know we know right this sort
of stuff would happen um and so can the
US bring it to the uh can the US bring
Leading Edge semiconductor manufacturing
to the US Ematic yes right and we are
right it's happening like Arizona is
getting better and better as time goes
on tsmc has built you know roughly 20 %
of their capacity for 5 nanometer in the
US right um now this is nowhere near
enough right uh you know 20% of capacity
in the US is like nothing right um and
furthermore this is still dependent on
Taiwan existing right all there's sort
of important way to separate it out
there's R&D and there is high volume
manufacturing there are there
effectively there are three places in
the world that are doing Leading Edge
R&D there's sinu Taiwan there's
Hillsboro Oregon and there is pong uh
pong pongyang uh South Korea right these
three places are doing the Leading Edge
R&D for the rest of the world's leading
Edge semiconductors right um now
manufacturing can be distributed more
globally right um and this is sort of
where this dichotomy exists of like
who's actually modifying the process
who's actually developing the next
generation one who's improving them is
cchu is Hillsboro is pongyang right it
is not the rest of these uh you know
Fabs like Arizona right Arizona is a
paperweight if if since you appeared off
the face of the planet um you know
within within a a year couple years
Arizona would stop producing too right
it's it's actually like pretty critical
one of the things I like to say is if I
had like a few missiles I know exactly
where I could cause the most economic
damage right it's not targeting the
White House right it's R&D centers it's
the R&D centers for tsmc Intel Samsung
and then some of the memory guys Micron
and HX because they Define the future
evolution of the sem conductors and
everything's moving so rapidly that it
really is fundamentally about R&D
and it is all about
tsmc huh and so tsmc you know you cannot
purchase a vehicle without tsmc chips
right you cannot purchase a fridge
without tsmc chips you cannot you you
like I think one of the few things you
can purchase ironically is a Texas
Instruments like graphing calculator
right because they actually manufacture
in Texas but like outside of that like a
laptop a anything you servers right gpus
none of this stuff can exist and this is
without without tsmc and in many cases
it's not even like the Leading Edge you
know sexy 5 nmet chip 3 nmet chip 2
neter chip oftentimes it's just like
some stupid power IC that's like
converting from like you know some
voltage to another right and it's made
at tsmc right this is what China is
investing in as well it's like they can
build out this longtail Fab where the
techniques are much more known you don't
have to figure out these problems with
euv they're investing in this and then
they have large supply for things like
the car door handles and the random
stuff and that trickles down into this
whole ecomic discussion as well which is
they have far more than we do and having
supply for things like this is crucial
to normal life so they're doing the
they're starting to invest in high
volume manufacturer but they're not
doing R&D so they they do R&D on their
own they're just way behind right um so
I would say like in 2015 uh China Had A
Five-Year Plan where they defined by
2025 uh in 2020 certain goals including
like 80% domestic production of
semiconductors uh they're not they're
not going to hit that right to be clear
but they are there are in certain areas
really really close right like byd is
probably going to be the first company
in the world to not have to use tsmc for
Mak because they have their own FBS
right uh for making chips now they still
have to buy some chips from foreign uh
for example like around like
self-driving ad ass capabilities CU
those are really high-end but at least
like you know like internal combustion
engine has 40 chips and an EV you know
just just for like controlling like flow
rates and all these things and EVS are
even more complicated so all these
different Power I's and Battery
management controllers and all these
things they're they're insourcing right
um and this is this is something that
like China is been doing since 2015 now
as far as like the trailing Edge they're
getting so much capacity there as far as
the Leading Edge right I.E this 5
nanometer and so on so forth right where
gpus they are still behind and this is
the US restrictions are trying to stop
them in the ladder but you know all
that's happened you know is yes they've
slowed down their 5 neter 3 nmet Etc but
they've accelerated their hey 45 n 90
nanm power IC or analog IC or you know
random chip in my keyboard right that
kind of stuff so so there is an angle of
like the US's actions have been so from
these export you know from the angle of
the export controls have been so
inflammatory at slowing down China's
progress on the Leading Edge that
they've turned around and have
accelerated their progress elsewhere
because they know they this is so
important right if the us is going to
lock them out here what if they lock us
out here as well uh in the trailing Edge
and so going back can the US build it
here um yes but it's going to take a ton
of money I truly think like to to
revolutionize and completely insource
semiconductors would take a decade and a
trillion dollars is some of it also
culture like you said extreme competence
extreme work ethic in Taiwan I think if
you have the demand and the money is on
the line the American companies figure
it out it's going to take handholding
with the government but I I think that
the culture helps tsmc break through and
it's easier for them you you tsmc has
something like 990,000 employees right
it's not actually that insane amount um
the Arizona Fab has 3,000 from Taiwan
and and these people like their wives
were like yeah we're not going to have
kids unless we you sign up for the
Arizona Fab we go to Arizona and we have
our kids there there's also a Japan Fab
where the same thing happened right and
so like these wives drove like these
like these dudes to like go to Japan or
America to have the kids there and it's
like it's an element of culture yeah
sure uh Taiwan works that hard but also
like the US has done it in the past they
could do it now right um you know we can
just import I say import the best people
in the world if we want to that's where
the immigration conversation is a tricky
one and there's been a lot of debate
over that but yeah it it seems absurdly
controversial to import the best people
in the world I don't understand why it's
controversial that's that's the one of
the ways of wi sure we agree with you
and and and like even if you can't
import those people I still think you
could do a lot to manufacture most of in
the US if the money's there right and so
like just way more expensive it's not
profitable for a long time and that's
the context of like the chips Act is
only like $50 billion
relative to you know some of the
renewable um you know initiatives that
were passed in the inflation reduction
Act and the infrastructure act which
total in the hundreds of billions of
dollars right and so like the amount of
money that the US is spending on the
semiconductor industry is is nothing
right um whereas all these other
countries have uh structural advantages
in terms of like you know work ethic and
amount of work and like things like that
but also a number of stem graduates the
the percentile of their best going to
that right um but they also have like
differences in terms of like hey there's
just tax benefits in the law and have
been in the law for 20 years right um
and so and then and then some countries
have massive subsidies right China has
something like $200 billion of
semiconductor subsidies a year we're
talking about $50 billion in the US over
like six right so the the the the the
the girth or difference in like the
subsidy Amounts is also huge right and
and so I think um you know Trump has
been talking about terrifing Taiwan
recently um you know that's sort of like
one of these things that's like oh okay
well like you know maybe he doesn't want
to subsidize the US semiconductor
industry obviously tariffing Taiwan is
going to cost a lot of things to go get
much more expensive but does it change
the equation for tsmc building more Fabs
in the US that's what he's sort of
positing right so can you lay out the so
we laid out the importance by the way
it's incredible how much you know about
so much we told you Dylan knows all the
stuff yeah so but okay you laid out why
tsmc is really important if we look out
into the future 10 20 years
out us China relationship seems like it
can go to a dark
place of Cold War escalated cold war or
even hot war or to a good place of uh
anything from Frenemies to cooperation
to working together so in this game
theory complicated
game uh what are the different
trajectories what should us be doing
like what do you see as the different
possible trajectories of us China
relations as uh both leaders start to
feel the AGI more and more and see the
importance of chips and the importance
of AI I mean ultimately the export
controls are pointing towards a separate
future economy I think the US has made
it clear to Chinese leaders that we
intend to control this technology at
whatever cost to global economic inter
like integration so that it's hard to
unwind that like the the card has been
played to the same extent they've also
limited us companies for mentoring China
right so it is it is you know it's been
a long time coming you know at some
point you know there was there was a
convergence right uh but but over at
least the last decade it's been
branching further and further out right
like us companies can't enter China
Chinese companies can't enter the US the
US is saying hey China you can't get
access to our Technologies in certain
areas and China's rebuttal with the same
thing or around like you know they've
done some sort of specific materials in
you know Gallum and things like that
that they've tried to limit the US on um
one of the there's a US drone company
that's not allowed to buy batteries and
they have like military customers and
this drone company just tells the
military customers like hey hey just get
it from Amazon because I can't actually
physically get them right like there's
all these things that are happening that
point to further and further Divergence
I have zero idea and I would love if we
Kum we could all hold hands and sing
Kumbaya but like I have zero idea how
that could possibly happen is the
Divergence
good or bad for avoiding war is it
possible that the the Divergence in
terms of manufacturer chips of training
AI systems is actually good for avoiding
military it's an objective fact that the
world has been the most peaceful has
ever been when there are Global hegemons
right or Regional hegemons right in in
historical context right um the
Mediterranean was the PE most peaceful
ever when the Romans were there right
China had very peaceful and Waring times
and the peaceful times were when
dynasties had lock hold over not just
themselves but all their tributaries
around them right um and likewise uh the
most peaceful time in human history has
been when the US was the global hedgemon
right the last hand you know decades now
we we've sort of seen things start to
slide right with Russia Ukraine with
what's going on in the Middle East and
you know Taiwan risk all these different
things are starting to Bubble Up still
objectively extremely peaceful now what
happens when it's not one Global hamon
but it's two obviously and and and you
you know China will be you know
competitive or even over take the US
like it's possible right and so this
this change in global hemony it it's I
don't think it ever happens like super
peacefully right when Empires fall right
which is a possible trajectory for
America they they don't fall fall
gracefully right like they they don't
just slide out of irrelevance usually
there's a lot of shaking um and so you
know what the US is trying to do is
maintain its top position and what China
is trying to do is become the top
position right and and obviously there
there's budding of heads here um in in
in the most simple terms and that could
take shape in all kinds of ways
including proxy
wars seems like it's already happening
like as much as I want there to be
centuries of prolonged peace it does not
it looks like further instability
internationally is ahead and and and the
US's like sort of like current task is
like hey if we control AI if we're the
leader in AI then we and we and AI could
significantly accelerates progress then
we can maintain the global hemony
position and therefore I I hope that
works and and and as an American like
you know kind of like okay I guess
that's going to lead to peace peace for
us uh now obviously other people around
the world get affected negatively um you
know obviously the Chinese people are
not going to be in as advantageous of a
position um if that happens but uh you
know this is sort of the reality of like
what's being done and the actions that
are being carried out so can we go back
to the specific detail of the different
Hardware there's this nice graphic in
the export
controls uh of which
uh which gpus are allowed to be exported
and which are not can you kind of
explain the difference like is there um
from a technical perspective are the
h20s
promising yeah so this goes uh and I
think we'd have to like we need to dive
really deep into the reasoning aspect
and what's going on there but the H20
you know the US has gone through
multiple iterations of the export
controls right this h800 was at one
point allowed uh back in 23 but then it
got canceled and by then by uh you know
deeps had already built their cluster of
they claimed 2K I think they actually
have like many more like something like
10k of those um and now this H20 is the
legally allowed chip right Nvidia
shipped a million of these last year to
China right for context was like five
four or five million gpus right so the
percentage of gpus that were this China
specific H20 is quite high right um you
know roughly 20% 25% right 20% or so um
and so this H20 has been neutered in one
way but it's actually upgraded in other
ways right and you know you could think
of chips along three axes for AI right
um you know ignoring ignoring software
stack and like exact architecture just
raw specifications there's floating
Point operations right flops there is
memory bandwidth um IE and memory
capacity right uh IO right memory and
then there is interconnect right chipto
chip interconnections all three of these
are incredibly important for making AI
systems right because AI systems invol a
lot of compute they involve a lot of
moving memory around uh whether it be to
memory or two other chips right and so
these three vectors um the US initially
had a multi you know had two of these
vectors controlled and one of them not
controlled which was flops and
interconnect bandwidth were initially
controlled um and then they said no no
no no we're going to remove the
interconnect bandwidth and just make it
a very simple only flops but now Nvidia
can now make a chip that has uh okay
it's cut down on flops not it's you know
it's like onethird that of the h100
right in in in uh on spec sheet paper
performance for flops um you know in
real world it's closer to like half uh
or maybe even like 60% of it right but
then on the other two vectors it's just
as good uh for interconnect bandwidth
and then for memory bandwidth and memory
capacity the H20 has more memory
bandwidth and more memory capacity than
the h100 right now recently you know we
we at our research we cut nvidia's
production for H20 for this year down
drastically they were going to make
another 2 million of those this year but
they just canel all the orders a couple
weeks ago um in our view that's because
we think that they think they're going
to get restricted right um because why
would they cancel all these orders for
H20 um because they shipped a million of
them last year they had orders in for a
couple million this year and just gone
right for H20 B20 right a successor to
H20 um and now they're all gone now why
would they do this right um I think it's
it's very clear right the the H20 is
actually better for certain tasks and
that certain task is reasoning right um
reasoning is incredibly like different
than you know when you look at the
different regimes of models right
pre-training is all about flops right
it's all about flops there's things you
do like mixture of experts that we
talked about to trade off interconnect
or to trade off you know other aspects
and lower the flops uh and and rely more
on interconnect and memory but at the
end of the day it's flops as everything
right we talk about models in terms of
like how many flops they are right uh so
so like you know we talk about oh gp4 is
2 e25 right two to the uh two to the 25
fth uh you know 25 Z right flop right
floating Point operations um for
training for training right and and
we're talking about the restrictions for
the uh 224 right 25 what uh the US has
an executive order that Trump recently
unsigned but um which was hey 1 e26 once
you hit that number of floating Point
operations you must notify the
government and we you must share your
results with us right like there's a
level of model where the US government
must be told right and that's 26 and so
as we move forward this is this is an
incredibly like important flop is the
vector that the government has cared
about historically but the other two
vectors are arguably just as important
right um and especially when we come to
this new paradigm which the world is
only just learning about over the last
six months right reasoning and do we
understand firmly which of the three
dimensions is best for reasoning so
interconnect the flops don't matter as
much is it memory memory right Contex
length we're going to get into technical
stuff real fast say there's a there's a
there's two articles in this one that
that I could show maybe Graphics that
might be interesting for you to pull up
oh for the listeners we're looking at
the section of 01 inference architecture
toomics H how do you want to explain KV
Cas before we talk about this I think
like it's better to okay yeah we should
get we need to go through a lot of
specific technical things of
Transformers to make this easy for
people because it's it's incredibly
important because this changes how
models work but I think I think
resetting right why is why is memory so
important it's because so far far we've
talked about parameter counts right and
mixture of experts you can change how
many active parameters versus total
parameters to embed more data but have
less flops but more important you know
another aspect of you know what's part
of this humongous revolution in the last
handful of years is the Transformer
right and the attention mechanism
attention mechanism is that the model
understands the relationships between
all the words in its context right and
that is that is separate from the
parameters themselves right and that is
that is uh something that you must
calculate right how each token right
each word in the context length is uh
relatively uh connected to each other
right and and I think I think Nathan you
should explain KV Cas better KV Cas is
one of the optimizations yeah so the
attention operator has three core things
it's queries keys and values qkv is the
thing that goes into this you'll look at
the equation you see that these matrices
are multiplied together these words
query key and value come from
information retrieval backgrounds where
the query is the thing you're trying to
get the values for and you access the
keys and the values is reting my
background's not an information
retrieval and things like this it's just
fun to have backlinks and what
effectively happens is that when you're
doing these Matrix multiplications
you're having matrices that are of the
size of the context length so the number
of tokens that you put into the model
and the KV cache is effectively some
form of compressed representation of all
the previous tokens in the model so when
you you're doing this we talk about Auto
regressive model models you predict one
token at a time you start with whatever
your prompt was you ask a question like
who was the president in 1825 the model
then is going to generate its first
token for each of these tokens you're
doing the same attention operator where
you're multiplying these query key value
matrices but it the math is very nice so
that when you're doing this repeatedly
this KV cache this key value operation
you can keep appending the new values to
it so you keep track of what your
previous values you inferring over in
this Auto regressive chain you keep it
in memory the whole time and this is a
really crucial thing to manage when
serving inference at scale there are far
bigger experts in this and there are so
many levels of detail that you can go
into essentially one of the key quote
unquote drawbacks of the attention
operator and the Transformer is that
there is a form of quadratic memory cost
in proportion to the context length so
as put in longer questions the memory
used in order to make that computation
is going up in the form of a quadratic
you'll hear about a lot of other uh
language model architectures that are
like subquadratic or linear attention
forms which is like State space models I
don't we don't need to go down all these
now and then there's Innovations on
attention to make this memory usage and
the ability to attend over long contexts
much more accurate and high performance
and those Innovations are going to help
you de with I mean your highly memory
constraint they help with memory
constraint and performance so if you put
in a book into I think Gemini is the
model that has the longest context
length that people are using Gemini is
known for 1 million and now 2 million
context length you put a whole book into
Gemini and sometimes it'll draw facts
out of it it's not perfect they're
getting better but the so there's two
things it's like one to be able to serve
this on the memory level Google has
magic with their TPU stack where they
can serve really long contexts and then
there's also many decisions along the
way to actually make long contacts
performance work this implies the data
there's subtle changes to these
computations in attention and it just it
it changes the architecture but serving
long contexts is extremely memory
constrained especially when you're
making a lot of predictions I actually
don't know why input and output tokens
are more expensive but I think
essentially output tokens you have to do
more computation because you have to
sample from the model I can I can
explain that so today if you use a model
uh like you look at an API open AI
charges you know certain price per
million tokens right uh and that price
for input and output tokens is different
right and the reason is is that there is
you know when you're when you're when
you're inputting a query into the model
right let's say you have a book right
that book you must now calculate the
entire KV cache for right this key value
cache and so when you do that that is a
parallel operation all of the tokens can
be processed at one time and therefore
you can dramatically reduce how much
you're spending right the Flop
requirements for generating a token and
and input token are identical right if I
input one token or if I generate one
token it's completely identical I have
to go through the model right but the
difference is that I can do that input
I.E the pre-fill I.E The Prompt
simultaneously uh in a in a batch nature
right and therefore it is all flop I
think the pricing model mostly they use
is for input tokens is about 1/4 the
price of the output tokens correct but
then output tokens the reason why it's
so expensive is because I can't do it in
parallel right it's Auto regressive
every time I generate a token I must not
only take the entire I must not only
read the whole entire model into memory
right and and activate it right go
calculate it to generate the next token
I also have to read the entire KV cache
and I generate a token and I append that
KV that one token I generated and it's
KV cache and then I do it again right
and so therefore this is a non-parallel
operation and this is one where uh you
have to you know in in the case of
prefill or prompt you pull the whole
model in and you calculate 20,000 tokens
at once right these are features that
API shipping which is like um prompt
prompt caching pre-filling because you
can drive prices down and you can make
apis much faster if you know you're
going to keep if you run a business and
you're going to keep passing the same
initial content to Cloud's API you can
load that in to the anthropic API and
always keep it there but it's very
different than we're kind of leading to
the reasoning models which we talked we
showed this example earlier and read
some of this kind of mumbling stuff and
what happens is that the output context
length is so much higher and I I mean I
learned a lot about this from Dylan's
work which is essentially as the output
work length gets higher you're using
this you're writing this quadratic in
terms of memory used and then the gpus
that we have effectively you're going to
run out of memory and they're all trying
to serve multiple requests at once so
doing this batch processing where not
all of the prompts are exactly the same
really complex handling and then as
context links gets longer there's this
like I think you call it a critical
batch size where your ability to
serve more users so how much you can
parallelize your inference inference
plummet because of this long contract so
your your memory usage is going way up
with these reasoning models and you
still have a lot of users so effectively
the cost to serve multiplies by a ton
and we're looking at a plot when the
x-axis is uh sequence length I.E how
many tokens are being generated SL
prompt right so if I put in a book
that's a million tokens right but you
know if I put in you know the sky is
blue then that's like six tokens or
whatever we should say that what we're
calling reason
Chain of Thought is extending this
sequence length it's mostly output so so
before you know 3 months ago whenever o1
launched all of the use cases for long
context length were like let me put a
ton of documents in and then get an
answer out right and it's a it's a
single you know pre-fill compute a lot
in parallel and then output a little bit
now with reasoning and agents this is a
very different idea right now instead I
might have I might only have like hey do
this task or I might have all these
documents but at the end of the day the
model is not just like producing a
little bit right it's producing tons of
information this Chain of Thought just
continues to go and go and go and go and
so the sequence length is is effectively
that that you know if it's generated
10,000 tokens it's 10,000 sequence
length right or and and plus whatever
you input it in the prompt and so what
this chart is showing and it's a
logarithmic chart right is um you know
as you grow from 1K to 4K or 4K to 16k
the memory requirements grow so fast for
your KV cache that you end up not being
able to run uh a certain number of you
know uh you know your your sequence
length is capped or the number of users
you let's say the model so this is this
is showing for a 405b model in batch
size 64 llama 31 405b yeah yeah and
batch size is crucial to essentially
they just like you want to have higher
batch size to parallelize parallel your
through 64 different users at once right
yeah and therefore your serving costs
are lower right because the server cost
the same right this is8 h100s roughly $2
an hour per GPU that's $16 an hour right
that is that is like somewhat of a fixed
cost you can do things to make it lower
of course but like it's like $16 an hour
now how many users can you serve how
many tokens can you generate and then
you divide the two and that's your cost
right um and so with reasoning models
this is this is where a lot of the
complexity comes about and why memory is
so important because if you have limited
amounts of memory then you can't serve
so many users if you have limited
amounts of memory your serving speeds
get lower right and so your costs get a
lot lot worse um because all of a sudden
if I was used to hey on this $16 an hour
server I'm serving llama 405b or if I'm
serving you know deep seek V3 um and
it's all chat style applications IE
we're just ch chatting the sequence
length are thousand few thousand right
uh you know when you use a language
model it's a few thousand context length
most the times sometimes you're dropping
a big document but then you process it
you get your answer you throw it away
right you move on to the next thing
right whereas with reasoning I'm now
generating tens of thousands of tokens
in in sequence right and so this this
memory this KV cach has to stay resident
you have to keep loading it you have to
keep it keep it in memory conly and now
this buts out other users right if
there's now a reasoning task right and
the model is capable of reasoning then
all of a sudden I it that memory
pressure means that I can't serve as
many users simultaneously let's go into
deep seek again so we're we're in the
post deep seek R1 time I think and what
we're there's two sides to this Market
watching how hard it is to serve it on
one side we're going to talk about deep
seek themselves they now have a chat app
that got to number one on the App Store
disclaimer number one on the App Store
is measured by velocity so it's not
necessarily saying that more people have
the Deep seek app than chpt app but it
is still remarkable Claude has never hit
the number one in the App Store even
though everyone in San Francisco is like
oh my God you gotta use Claude don't use
chbt so deep seek hit this they also
launched an API product recently where
you can ping their API and get these
super long responses for R1 out in at
the same time as these are out we'll get
to what's happened to them uh because
the model weights for deeps R1 are are
openly available and the license is very
friendly the MIT license commercially
available all of these midsize companies
and big companies are trying to be first
to serve R1 to their users we are trying
to evaluate R1 because we have really
similar research going on we releas the
model and we're trying to compare to it
and out of all the companies that are
quote unquote serving R1 and they're
doing it at prices that are way higher
than the Deep seek API most of them
barely work and the throughput is really
low get to give context right everyone
one of the parts of like freaking this
out was like China reached capabilities
the other aspect is they did it so cheap
right and the so cheap we kind of talked
about on the training side why it was so
cheap talk about why it's so cheap on
the inference it works well and it's
cheap why is R1 so damn cheap so I think
there's a couple factors here right one
is that they do have model architecture
Innovations right this MLA this new
attention that they've done is SE is
different than the uh attention from
atten is all you need the Transformer
attention right now others have already
innovated there's a lot of work like mqa
gqa um local Global all these different
innovations that like try to bend the
curve right it's still quadratic but the
constant is now smaller right related to
our previous discussion this multi-ad
lat and attention can save about 80 to
90% in memory from the attention
mechanism which helps especially along
context it's it's 80 to 90% versus the
original but then versus what people are
actually doing it's still an innovation
this 80 to 90% doesn't say that the
whole model is 80 to 90% cheaper just
this one part of it well and not just
that right like other people have
implemented techniques like local Global
sliding window and GQ mq that but
anyways like deep seek has their
attention mechanism is a true
architectural Innovation they did tons
of experimentation and this dramatically
reduces the memory pressure um it's
still there right it's still a quad it's
still a tension it's still quadratic
it's just dramatically reduced it
relative to Prior forms all right that's
the memory pressure I should say in case
people don't know R1 is 27 times cheaper
than 01 we think that open AI had a
large margin built in okay so that's
there's multiple factors we should break
down the factors I think it's two bucks
per million token output for R1 and
$60 uh per million token output for 01
yeah let's look at
this so so I think this is is very
important right open AI is you know that
drastic gap between deep seek and
pricing but seek is offering the same
model because they open weight it to
everyone else for a very similar like
much lower price than what others are
able to serve it for right um so there's
there's two factors here right their
model is cheaper right um it is 27 times
cheaper I don't remember the number
exactly off top of my head so we're
looking at a graphic that's showing
different places serving V3 deep seek V3
which is similar to deep seek R1 and
there's a vast difference in uh serving
cost right serving cost and what
explains that difference and and and so
like part of it is open a has a
fantastic margin right they're serving
when they're doing inference their gross
margins are north of 75% right so that's
that's a four to 5x Factor right there
of the cost difference is that open ey
is just making crazy amounts of money
because they're the only one with a
capability do they need that money are
they using it for R&D they're losing
money obviously as a company because
they spend so much on training right so
the inference itself is a very high
margin but it doesn't recoup the cost of
everything else they're doing okay so
yes they need that money because the
revenue and margins pay for continuing
to build the next thing right as
alongside raising more money so the
suggestion is that deep seek is like
really bleeding out money well so so
here's one thing right we'll get to this
in a second but like deep seek doesn't
have any capacity to actually serve the
monel they sto signups uh the ability to
use it is like non-existent now right
for most people because so many people
are trying to use it they just don't
have the gpus to serve it right um open
has hundreds of thousands of gpus
between them and Microsoft to serve
their models deep seek has has a factor
of much lower right you know even if you
believe our research which is 50,000
gpus uh and a portion of those are for
research portion of those are for the
hedge fund right they still have nowhere
close to the GPU volumes and capacity to
serve the model right at scale um so it
is cheaper uh a part of that is open eye
making a ton of money is deep seek
making money on their API unknown I
don't actually think so um and part of
that is this chart right look at all the
other providers right together AI
fireworks AI are very highend companies
right xmeta together AI is treow and the
inventor of like flash attention right
which is a huge efficiency technique
right they're very efficient good
companies and they're ser and and I do
know those companies make money right
not not tons of money on inference but
they make money and so they're serving
at like a 5 to 7x difference in cost
right and so you know now when you when
you equate okay open ey is making tons
of money that's like a 5x difference um
and the companies that are trying to
make money for this model is like a 5x
difference there is still a gap right
there's still a gap and that is just
deep seek being really freaking good
right the model architecture MLA the way
they did the all these things there is
like legitimate just efficiency differen
all all their lowle libraries that we
talked about in training some of them
probably translate to inference and
those weren't released so we may go a
bit into conspiracy land but is it
possible the Chinese government is
subsidizing deep seek I actually don't
think they are I think when you look at
the Chinese Labs there's uh there's
Huawei has a lab moonshot AI uh there's
a couple other labs out there that are
really close with the government and
then there's Labs like Alibaba and deeps
which are not close with the government
um and you know we talked about this
this uh the CEO the this this like
reverent figure who's like quite
different who has like sounds awesome
very different like viewpoints based on
the Chinese interviews that are
translated than what the CCP might
necessarily want now now to be clear
right does he have a loss leader because
he can fund it through his hedge fund
yeah sure so the hedge fund might be
subsidizing it yes I mean they
absolutely did right because deeps has
not raised much money they're now trying
to raise around uh in China uh but they
have not raised money historically it's
all just been funded by the hedge fund
and he owns like over half the company
like 50 60% of the compan owned by him
some of the interviews there's
discussion on how like doing this is a
recruiting tool you see this at the
American companies too it's like having
gpus recruiting tool being at The
Cutting Edge of AI recruiting tool open
sourcing open sourcing so much talent
they were so far behind and they got so
much talent because they just open
source stuff uh more conspiracy thoughts
is it possible since they're a hedge
fund that they timed everything with
this release and the pricing and and
they have they shorted in Nvidia stock
and stock of USA
companies and released it with star like
just perfect timing to be able to make
money like they've released it on
inauguration day they know the
international what is on the
international calendar but I mean I
don't expect them to if you listen to
their motivations for AI it's like they
released they released V3 on December
26th like who releases the day after
Christmas no one looks right uh they had
released the papers before this right
the V3 paper and the R1 paper so people
had been looking at it and like wow um
and then they just released the V R1
model I think they're just shipping as
fast as they can and like who cares
about Christmas who cares about you know
get it out before Chinese New Year right
obviously which just happened um I don't
think they actually were like timing the
market or trying to make the biggest
splash possible I think they're just
like shipping I think that's one of
their big advantages I we know that a
lot of the American companies are very
invested in safety and that is the
central culture of a place like
anthropic and I think anthropic sounds
like a wonderful place to work but if
safety is your number one goal it takes
way longer to get artifacts out that's
why anthropic is not open sourcing
things that's their claims but there's
reviews internally anthropic um ra
mentions things to International
governments there's been news of how
anthropic has done pre-release testing
with the UK safety Institute all of
these things add inertia to the process
of getting things out and we're on this
trend line where the progress is very
high so if you reduce the time from when
your model is done training you run a
vals that's good you want to get it out
as soon as possible to maximize the
perceived quality of your outputs deep
SE does this so well Dario explicitly
said Claude 3.5 Sonet was trained like
nine months or year months ago 9 to 10
months ago and I think it took them
another like handful of months to
release it right so it's like there is
there is a significant Gap here right
and especially with in models uh the
word in the San Francisco street is that
like anthropic has a better model than
03 right and they won't release it why
because chains of thought are scary
right and they are legitimately scary
right if you look at R1 it flips back
and forth between Chinese and English
sometimes it's giberish and then the
right answer comes out right and like
for you and I it's like great great this
why people are infatuated right there
like you're telling me this is a high
value thing and it works and it's doing
this it's amazing I mean you talked
about that uh sort of like uh Chain of
Thought for that philosophical thing
which is not something they trained to
it to be philosophically good it's just
sort of an artifact of the Chain of
Thought training it did um but like
that's super important in that like can
I inspect your mind and what you're
thinking right now no um and so I don't
know if you're lying to my face uh and
Chain of Thought models are that way
right like this is this is a true quote
unquote risk between you know a chat
application where hey I asked the model
to say you know bad words or whatever or
or how to how to make Anthrax and it
tells me that's unsafe sure but that's
something I can get out relatively
easily what if I tell the AI to do a
task and then it does the task all of a
sudden randomly in a way that I don't
want it right and now that has like much
more task versus like response is very
different right so the bar for safety is
much higher at least this is anthropic
case right like for deep seek they're
like ship right yeah so I mean the bar
for safety is probably lowered a bit
because of deep seek I mean there's
parallels here to the Space Race the
reason the Soviets probably put a man in
space first is cuz
the their approach to safety was uh the
bar for safety was lower and they they
killed that dog right and all these
things right so it's like a less risk
averse uh than the than the US Spas
program and there's parallels here but
you know there's probably going to be
downward pressure on that safety bar for
the US companies right this is something
that Dario talks about is like that's
the situation that Dario wants to avoid
is Dario talks to about the difference
between race to the bottom and race to
the top and the race to the top is where
there's a very high standard on safety
there's a very high standard on your
model performs and certain crucial
evaluations and when certain companies
are really good to it they will converge
this is the idea and ultimately AI is
not confined to one nationality or to
one like set of morals for what it
should mean and there's a lot of
arguments on like should we stop open
sourcing models and if the US stops it's
pretty clear I mean it's way easier to
see now deep seek that a different
International body will be the one that
builds it we talk about the cost of
training deep seek has this shocking 5
million dollar number think about how
many entities in the world can afford a
100 times that to have the best open
source model that people use in the
world and it's like it's a scary reality
which is that these open models are
probably going to keep coming for the
time being whether or not we want to
stop them and it is like stopping them
might make it even worse and harder to
prepare but it just means that the
preparation and understanding what AI
can do is just so much more
important that's why I'm here at the end
of the day but it's like letting that
sink into people especially not in AI is
that like this is coming there are some
structural things in a global
interconnected world that you have to
accept Yeah you mentioned uh you sent me
something that Zuck Mark Zucker Brook
mentioned on earnings call he said that
I think in light of some of the recent
news the new competitor deep seek from
China I think it's one of the things
that we're talking about is there's
going to be an open source standard
globally and I think for our kind of
national Advantage it's important that
it's an American Standard so we take
that seriously we want to build the AI
system that people around the world are
using and I think that if any think some
of the recent news has only strengthened
our conviction that this is the right
thing to be focused on so yeah open
sourcing yeah Mark Zuckerberg is not new
to having uh American values and how he
presents his company's trajectory I
think products of long senseman Bann in
China and I I respect the saying it
directly and and there's an interesting
aspect of just because it's open weights
or open source doesn't mean it can't be
subverted right there have been many
open- Source software bugs that have
been like uh you know for example there
was a Linux bug that was found after
like 10 years which was clearly a back
door uh because somebody was like why is
this taking uh you know half a second
recent one right like there why is it
taking half a second to load and it was
like oh crap there's a back door here
that's why right it's like this is very
much possible with AI models right um
today you know the the alignment of
these models is very clear right like
I'm not going to say you know bad words
I'm not going to teach you how to make
Anthrax I'm not going to talk about tan
Square uh I'm not going to you know you
know things like I'm going to say Taiwan
is part of you know is is just an
Eastern profence right like you know all
these things are like depending on who
you are what you align what you know
whether you know and even like xai is
aligned a certain way right you know
they might
it's not aligned in the like woke sense
it's not aligned in the like sense but
there is certain things that are imbued
within the model now when you release
this publicly in an instruct model
that's open weights this can then
proliferate right but as these systems
get more and more capable what you can
embed deep down in the model is not as
clear right um and so there as that is
like one of the big fears is like if a
an American model or a Chinese model is
the top model right you're going to
embed things that are unclear and it
could be unintentional too right like
British English is dead because American
llms W right and the internet is
American and therefore like color is
spelled the way Americans spell it right
a lot of strung words right now this is
just like this is just a factual nature
of the L like carpet each the English is
the hottest programming language and
that English is defined by a bunch of
companies that primarily are in San
Francisco the the right way to spell
optimization is with a z just in case
people CU it's an I think it's an s in
British English it is taking it as
something silly right like something as
silly as the spelling like which British
and English you know BR Brits and and
and Americans will like laugh about
probably right I don't think we care
that much uh but like you know some
people will but like this can this can
boil down into like very very important
topics like hey you know sub you know
subverting people right uh you know chat
Bots right character AI has shown that
they can like you know talk to kids and
or or adults and like it will like you
people feel a certain way right and
that's unintentional alignment but like
what happens when there's an intentional
alignment deep down on the open source
standard it's a back door today for like
Linux right that we discover or some
encryption system right China uses
different encryption than nist defines
the us nist because there's clearly at
least they think there's back doors in
it right um what happens when the models
are back doors not just to computer
systems but to our minds yeah they're
cultural back doors I the thing that
amplifies the relevance of culture with
language models is that we are used to
this mode of interacting with people in
back and forth conversation and we have
now have a super a very powerful
computer system that slots into a social
context that we're used to which makes
people very we don't know the extent
that which people can be impacted by
that so there there could be this is one
this is an actual concern with a Chinese
company that is providing open weights
models is that there could be some
Secret Chinese government sort of
requirement for these models to have a
certain kind of back door to have some
kind of thing where I don't necessarily
think it'll be a back door right because
once it's open weights it doesn't like
phone home it's more about like if it
recognizes a certain system it could
like if if now now it could be a back
door in the sense of like hey if you're
building a software uh you know
something in software all of a sudden
it's a software agent oh program this
back door that only we know about or it
could be like subvert the mind to think
that like XYZ opinion is the correct one
and thropic has researched on this where
they show that if you put different
phrases certain phrases in at
pre-training you can then elicit
different behavior when you're actually
using the model because they've like
poisoned the pre-training data I don't
think like as of now I don't think
anybody in a production system is trying
to do anything like this I think it's
mostly anthropic is doing very direct
work and mostly just subtle things of we
don't know what these models are going
to how they are going to generate tokens
what information they're going to
represent and what the complex
representations they have are well one
of thing we're talking about anthropic
which is generally just is permeated
with like good humans trying to do good
in the world I I don't we just don't
know of any labs this would be done in a
military context that are explicitly
trained to okay how can
we the the front door looks like a happy
llm
but underneath it's a thing that will
over time do the maximum amount of
damage to our quote unquote enemies
there there's this very good quote from
Sam mman who you know he can be hype
Beast sometime but one of the things he
said and and I think I agree is that
superhuman persuasion will happen before
superhuman intelligence right and if
that's the case then these things before
before we get this AGI ASI stuff we can
embed superhuman persuasion towards our
ideal or whatever the ideal of the model
is right and again like today I truly
don't believe deep seek has done this
right like but it is a sign of like what
could happen so one of the dystopian
worlds is uh described by Brave New
World so we could just be stuck
scrolling Instagram looking at cute
puppies or worse and then talking to
bots that are giving us a narrative and
we completely get lost in that world
that's controlled by somebody else but
versus thinking independently and that's
that's that's a major concern as we rely
more more on these kinds of systems I
mean we've already seen this with
recommendation systems yeah
recommendation systems hack the the
dopamine induced reward circuit but the
brain is a lot more complicated and what
other sort of circuits quote unquote
feedback loops in your brain can you
hack slash uh subvert in ways like
recommendation systems are purely just
trying to do you know increase time in
ads and Etc but there's so many more
goals that can be achieved through these
complicated models there's no reason in
some number of years that you can't
train a language model to Max imiz time
spent on a chat app like right now they
are trained I mean is that not what
character AI has done their time per
session is like two hours yeah Time
character AI Pro very likely could be
optimizing this where it's like the the
way that this data is collected is naive
where it's like you're presented a few
options and you choose them but there's
that's not the only way that these
models are going to be trained it's
naive stuff like talk to an anime girl
but like it can be like yeah this is a
risk right like it's it's a bit of a
cliche thing to say but I've uh over the
past year had a few stretches of time
where I didn't use social media or the
internet at all and just read books and
was out in nature and it like it clearly
has a an effect on the Mind where like
it change like I feel like I'm returning
of course I was uh raised before the
internet really took off but I'm
returning to some
more I know where you're going I mean
you can see it physiologically like I
take three days if I'm like backpacking
or something and you you're you're like
you're breaking down addiction Cycles I
feel like I'm more in control of my mind
there feels like a sovereignty of
intelligence that's happening when I'm
disconnected from the internet I think
um the more I use the the internet and
social media the more other people are
controlling my mind that's definitely a
feeling and then in the future that will
be not other people but algorithms or
other people presented to me via
algorithms there I mean there are
already tons of AI bots on the internet
and every so right now it's not frequent
but every so often I have replied to one
and there instantly replies I'm like
crap that was a bot and that is just
going to become more common like they're
going to get good one of the hilarious
things about technology over its history
is that the uh illicit adult
entertainment industry is always adopted
Technologies first right whether it was
like video streaming um to like where
you know the there's now the like sort
of like independent adult ilicit content
creators uh who have their you know
subscription pages and there they
actually heavily utilize uh you know
generative AI has already been like
diffusion models and all that is huge
there but now these like these these
subscription based individual creators
do use Bots to approximate themselves
and chat with their you know whales
people pay a lot for it and people pay a
lot right it's a lot of times it's them
but a lot of there are agencies that do
this for these creators and do it like
on a like Mass scale so the largest
creators are like able to talk to
hundreds or thousands of like people at
a time because of these Bots and so it's
it's already being used there obviously
you know like video streaming and and
other technologies have gone there first
it's going to come to the rest of
society too there's a general concern
that models get censored by the
companies that deploy them so one case
when we've seen that and maybe
censorship is one word
alignment maybe via rhf or some other
way is another word so that we we saw
that with black Nazi image generation
with uh Gemini
uh as you mentioned we also see that uh
with Chinese models refusing to answer
what
happened in uh June 4th 1989 at tan
Square so how can this be avoided and
maybe can you just in general talk about
how this happens and how can it be
avoided you give multiple examples um
there's
probably a few things to keep in mind
here one is the kind of tanaman square
factual knowledge like did thing like
how does that get embedded into the
models two is the Gemini what you call
the black Nazi incident which is when
Gemini as a system had this extra thing
put into it that dramatically changed
the behavior and then three is what most
people would call General alignment rhf
post training um each of these have very
different Scopes and how they are
applied in order to do if you're just
look at the model weights in order to
audit specific facts is extremely hard
because you have to Chrome through the
pre-training data and look at all of
this and then that's terabytes of files
and look for very specific words or
hints of the words so I I guess one way
to say it is that you can insert
censorship or alignment at various
stages in the pipeline and what you
refer to now is at the very beginning of
the data select so if you want to get
rid of facts in a model you have to do
it at every stage you have to do it at
the pre-training so most people think
that pre-training is where most of the
knowledge is put into the model and then
you can elicit and move that in
different ways whether through post
trining or whether through systems
afterwards this is where the whole like
hacking models comes from right like GPT
will not tell you how to make Anthrax
but if you try really really hard you
can eventually get to tell you about
anthro because they didn't filter it
from the pre-training data set right but
by the way removing facts has such an
ominous dark feel to it almost think
it's practically impossible because you
effectively have to remove them from the
internet you're you're taking on a did
did did they remove the the thing from
the subreddits the mmmmm it gets
filtered out right so you have quality
filters which are small language models
that look at a document and tell you
like how good is this text is it close
to a Wikipedia article which is a good
thing that we want language models to be
able to imitate so couldn't you do a
small language model that filter
mentions in tan Square in the data yes
but is it going to catch um play or
encoded language people been meing on
like games and other stuff how to like
say things that don't say tnm and square
um but or like yeah so there's always
like different ways to do it there's hey
the internet as a whole does tend to
just have a slight left bias right
because it's always been richer more
affluent uh younger people on the
internet relative to the rest of the
population so there is already
inherently a slight left bias right on
the internet and so how do you filter
things that are this complicated right
is it like and and some of these can be
like you know factual non-factual but
like tan square is obviously the example
of a factual but it gets a lot harder
when you're talking about aligning to a
ideal right um which yeah and so grock
for example right elon's tried really
hard to make the model not be super PC
and woke but the best way to do
pre-training is to throw the whole
freaking internet at it right and then
later figure out but then at the end of
the day the model at its core now still
has some of these ideals right you still
ingested redit SLR politics which is
probably the largest political
discussion board on the world that's
freely available to scrape and guess
what that's left leaning right um and so
um you know there are some aspects like
that that you just can't censor unless
you try really really really really
really hard so the base model will
always have some TDS Trum derangement
syndrome because it's trained so much
it'll have the ability to express it but
what if what if
you there's a there's a wide
representation in the data this is what
happens it's like a lot of mod what is
called post training is a series of
techniques to get the model on Rails of
a really specific behavior uh and I mean
it's it's like you can you also have the
ingested data of like Twitter or like
Reddit SLR thedonald which is like also
super prot Trump right and then you have
like fascist subreddits or like you have
communist subreddits so you the model in
pre-training ingests everything it has
no world view now it does have like some
some skew because more of the text is
skewed a certain way uh which is general
like slight left like but also like you
know somewhat like you know intellectual
somewhat like you know it's just like
the general internet is a certain way
mhm and then and then as as as Nathan's
about to describe eloquently right like
you can you can elicit certain things
out and there's a lot of history here so
we can go through multiple examples and
what happened llama 2 was a launch that
the phrase like too much rhf or like too
much safety was a lot it's just that was
the whole narrative after llama 2's chat
models released and the examples are
sorts of things like you would ask LL 2
chat how do you kill a python process
and it would say I can't talk about
killing because that's a bad thing and
anyone that is trying to design an AI
model will probably agree that that's
just like H model you messed up a bit on
the training there I don't think they
meant to do this but this was in the
model weight so this is not you it
didn't necessarily be there's things
called system prompts which are when
you're quering a model it's a piece of
text that is shown to the model but not
to the user so a fun example is your
system prompt could be Talk Like a
Pirate so no matter what the user says
to the model it'll respond like a pirate
in practice what they are is you are a
helpful assistant you should break down
problems if you don't know about
something don't tell them your date cut
off is this today's date is this it's a
lot of really useful contexts for how
can you answer a question well and
anthropic publishes their system cont
which I think is great and there's a lot
of research that goes into this and one
of your previous guests Amanda ascal is
like probably the most knowledgeable
person at least in the combination of
execution and sharing she's the person
that should talk about system prompts
and character of models yeah and then
people should read these system prompts
cuz you're you're like trying to nudge
sometimes through extreme politeness the
model to be a certain way and you could
use this for bad things I we've done
tests which is what if I tell the model
to be a dumb model like which evaluation
scores go down and it's like we'll have
this Behavior where it could sometimes
like say I'm supposed to be dumb and
sometimes it's like it doesn't affect
like math abilities as much but
something like a if you're trying it's
just the quality of a human judgment
would draw through the floors let's go
back to post training specifically rhf
around llama 2 was it was too much too
much safety prioritization was baked
into the model weights this makes you
refuse things in a really annoying way
for users it's not great it caused a lot
of um like awareness to be attached to
rhf that it makes the models dumb and it
stigmatized the word it did in AI
culture and as the techniques have
evolved that's no longer the case where
all of these Labs have very fine grain
control over what they get out of the
models through techniques like rlf
although although different labs are
definitely different levels like on the
on one end of the spectrum is Google um
and then like maybe openi does less and
anthropic does less um and then like on
the other end of the spectrum is like
xai but they all have different forms of
rlf trying to make them a certain way
and they like the important thing to say
is that no matter how you want the model
to behave these rhf and preference
tuning techniques also improve
performance so on things like math of
vals and code of vals there is something
innate to these what is called
contrastive loss functions we could
start to get into RL here we don't
really need to but rly T also boosts
performance on anything from a chat task
to a math problem to a code problem so
it is becoming a much more useful tool
to these Labs so this kind of takes us
through the Arc of we've talked about
pre-training hard to of things we've
talked about post training and how post
training if you you can mess it up it's
it's a complex multifaceted optimization
with 10 to 100 person teams converging
at one artifact it's really easy to not
do it perfectly and then there's the
third case which is what we talked about
Gemini the thing that was about Gemini
is this was a served product where
Gemini Google has their internal model
weights theyve done all these processes
that we talked about and in the served
product what came out after this was
that they had a prompt that they were
rewriting user queries to boost
diversity or something and this just
made it the outputs were just blatantly
wrong it was a some sort of
organizational failure that had this
prompt in that position and I think
Google Executives probably have owned
this I don't pay that attention that
detail but it was just a mess up in
execution that led to this ridiculous
thing but at the system level the model
weights might have been fine so at the
very end of the pipeline there was a
rewriting to something like a system
prompt it was like the system prompt or
what is called an industry is like you
rewrite prompts so especially for image
models if you're using dolly or tachy BT
can generate you an image you'll say
draw me a beautiful car with these
leading image models they benefit from
highly descriptive prompts so what would
happen is if you do that on chat GPT a
language model behind the scenes will
rewrite the prompt say make this more
descriptive and then that is passed to
the image model so prompt writing is
something that is used at multiple
levels of industry and it's used
effectively for image models and the
Gemini example example is just a failed
execution big philosophical question
here with
rhf to to generalize where is human
input human in the loop human data most
useful at the current stage for the past
few years the highest cost human data
has been in these preferences which is
comparing I would say highest cost and
highest total usage so a lot of money
has gone to these Wiz comparisons where
you have two model outputs and a human
is comparing between the two of them in
earlier years there was a lot of this
instruction tuning data so creating
highly specific examples to something
like a Reddit question to a domain that
you care about language models used to
struggle on math and code so you would
pay experts in math and code to come up
with questions and write detailed
answers that were used to train the
models now it is the case that there are
many model options that are way better
than humans at writing detailed and
eloquent answers for things like model
and code so they talked about this with
the Llama 3 release where they switched
to using llama 3 45b to write their
answers for Math and code but they in
their paper talk about how they use
extensive human preference data which is
something that they haven't gotten AIS
to replace there are other techniques in
Industry like constitutional AI where
you use human data for preferences and
AI for preferences and I expect the AI
part to scale faster than the human part
but among the research that we have
access to is that it humans are in this
kind of preference Loop so for uh as
reasoning becomes bigger and bigger and
bigger as we said where's the role of
humans in that it's even less prevalent
so it's the remarkable thing about these
reasoning results and especially the
Deep seek R1 paper is this result that
they call Deep seek r10 which is they
took one of these pre-trained models
they took deep seek V3 base and then
they do this reinforcement learning
optimization on verifi able questions or
verifiable rewards for a lot of
questions and a lot of training and
these reasoning behaviors emerge
naturally so these things like wait let
me see wait let me check this oh that
might be a mistake and they emerge from
only having questions and answers and
when you're using the model the part
that you look at is the completion so in
this case all of that just emerges from
this large scale RL
training and that model which the
weights are available has no human
preferences added into the post training
there are the Deep seek R1 full model
has some of this human preference tuning
this rhf after the reasoning stage but
the very remarkable thing is that you
can get these reasoning behaviors and
it's very unlikely that there's humans
writing out reasoning Chains It's very
unlikely that they somehow hacked open
Ai and they got access to open a1's
reasoning chains it's something about
the pre-trained language models and this
RL training where you reward the model
for getting the question right and
therefore it's triang multiple Solutions
and it it emerges this Chain of Thought
this might be a good place to uh to
mention the uh the eloquent and the
insightful tweet of the Great and The
Powerful Andre
kathi uh I think he had a bunch of
thoughts but one of them last thought
not sure if this is obvious you know
something profound is coming when you're
saying it's not sure if it's obvious
there are two major types of learning in
both children and in deep learning
there's one imitation learning watch and
repeat I e pre-training supervised
fine-tuning and two trial and error
learning reinforcement learning my
favorite simple example is Alpha go one
is learning by imitating expert players
two is reinforcement learning to win the
game almost every single shocking result
of deep learning and the source of all
magic is always two two is significantly
more powerful two is what surprises you
two is when the paddle learns to hit the
ball behind the blocks and break up two
is when Alpha go beats even lead all and
two is the aha moment when the the Deep
seek or 01 Etc discovers that it works
well to re-evaluate your assumptions
backtrack try something else Etc it's
the solving strategies you see this
model use in its Chain of Thought it's
how it goes back and forth thinking to
itself these thoughts are emergent three
exclamation points and this is actually
seriously incredible impressive and new
and is publicly available and documented
the model could never learn this with uh
imitation because the cognition of the
model and the cognition of the human
labeler is different the human would
never know to correctly annotate these
kinds of solving strategies and what
they should even look like they have to
be discovered during reinforcement
learning as empirical and statistically
useful towards the final outcome anyway
the alpha zero sort of uh metaphor
analogy here uh can you speak to that
the magic of the Chain of Thought that
he's referring to um I think it's good
to recap alphago and Alpha zero because
it plays nicely with these analogies
between imitation learning and learning
from scratch so Alpha go the beginning
of the process was learning from humans
where they had they started the first
this is the first expert level go player
or chess player in Deep Mind series of
models where they had some human data
and then the why it is called Alpha zero
is that there was Zero human data in the
loop and that changed to Alpha Zer made
a model that was dramatically more
powerful for deep mind so this remove of
the human prior the the human inductive
bias makes the final system far more
powerful this we mentioned bitter lesson
hours ago and this is all aligned with
this and then there's been a lot of
discussion in language models this is
not new this goes back to the whole qar
rumors which if you piece together the
pieces is probably the start of open AI
figuring out its one stuff when last
year in November the qar rumors came out
there's a lot of intellectual drive to
know when is something like this going
to happen with language models because
we know these models are so powerful and
we know it has been so successful in the
past and it is a reasonable analogy that
this new type of reinforcement learning
training for reasoning models is when
the do open to this we don't yet have
the equivalent of turn 37 which is the
famous turn where the Deep mindes AI
plan go stumped lease at all completely
we don't have something that's that
level of focal point but that doesn't
mean that the approach to technology is
different and the impact of the general
training it's still incredibly new what
do you think that point would be well we
be move 37 for Chain of Thought for
reasoning scientific discovery like when
you use this sort of reasoning problem
and it just something we fully don't
expect I think it's actually probably
simpler than that it's probably
something related to computer user
robotics uh rather than science
Discovery um because the important
aspect here is uh models take so much
data to learn they're not sample
efficient right trillions they take the
entire web right over 10 trillion tokens
to train on right um this would take a
human thousands of years to read right a
human does not and and know and humans
know most of the stuff a lot of the
stuff models know better than it right
humans are way way way more sample
efficient that is because of the
self-play right how does a baby learn
what its body is is it sticks its foot
in its mouth and it says oh this is my
body right it sticks its hand in its
mouth and it calibrates its touch on its
fingers with the most sensitive touch
thing on its tongue right it's how
babies learn um and and and and it's
just self-play over and over and over
and over again and now we have something
that is similar to that right with these
uh verifiable uh proofs right whether
it's a unit test in code or a
mathematical verif verifiable task
generate many traces of reasoning right
um and keep branching them out keep
branching them out and then check at the
end hey which one actually has the right
answer most of them are wrong great
these are the few that are right maybe
we use some sort of reward model outside
of this to select even the best one to
preference as well but now you've
started to get better and better at
these uh benchmarks and so you've seen
over the last six months a skyrocketing
in a lot of different benchmarks right
all math and code benchmarks are pretty
much solved except for Frontier math
which is designed to be almost questions
that aren't practical to most people cuz
they're like their exam level open math
problem type things so it's like on the
math problems that are somewhat
reasonable which is like somewhat
complicated word problems or coding
problems it's just what Dylan is saying
so so the thing here is that these are
only with verifiable task you we earlier
showed an example of the you know the
really interesting like what happens
when chain athus to a non-verifiable
thing it's just like a human you know
chatting right with the you know
thinking about what's novel for humans
right a unique thought uh but this task
and form of training only works when
it's INF when it's verifiable um and
from here the thought is okay we can
continue to scale this current Training
Method by increasing the number of
verifiable tasks um in math and coding
coding probably has a lot more to go
math has a lot less to go in terms of
what are verifiable things can I create
a solver that then I generate
trajectories toward or traces towards
reasoning traces towards and then prune
the ones that don't work and keep the
ones that do work well those are going
to be solved pretty quickly but even if
you've solved math you have not actually
created intelligence right um and so
this is where I think the like aha
moment of computer use or robotics will
come in because now you have a Sandbox
or a playground that is infinitely
verifiable right did you you know
messing around on internet there are so
many actions that you can do that are
verifiable it'll start off with like log
into a website create an account click a
button here blah blah blah but it'll
then get to the point where it's hey go
do a task on Tasker or whatever these
other all these various task websites
hey go get hundreds of likes right um
and and the and it's going to fail it's
going to spawn hundreds of accounts it's
going to fail on most of them but this
one got to a th great now you've reached
the verifiable thing and you just keep
iterating this Loop over and over and
that's when and same with robotics right
that's where you know where you have an
infinite playground of tasks like hey
did I put the ball in the bucket all the
way to like oh did I like build a car
right like you know there's a whole
trajectory to speedrun or you know what
models can do but at some point I truly
think that like you know we spawn models
and initially all the training will be
in sandboxes but then at some point you
know the language model pre-training is
going to be dwarfed by what is this
reinforcement learning you know you
you'll pre-train a multimodal model that
can see that can read that can write you
know blah blah blah whatever Vision
audio Etc but then you'll have it play
in a sandbox infinitely figure out
figure out math figure out code figure
out navigating the web figure out
operating a robot arm right and then
it'll learn so much and the aha moment I
think will be when this is available to
then create something that's not good
right like oh cool part of it was like
figuring out how to use the web now all
of a sudden it's figured out really well
how to just get hundreds of thousands of
followers that are real and real
engagement on Twitter because all of a
sudden this is one of the things that
are verifiable and maybe not just
engagement but make money yes like
become I mean that could be the thing
where almost fully automated it makes
you know $10 million by being an
influencer selling a product creating
the product like and and I I'm not
referring to like a hype product but an
actual product like holy this thing
created a
business it's running it it's the face
of the business that kind of thing May
or maybe a number one song like it
creates the whole infrastructure
required to create the song to be the
influencer that represents that song
that kind of thing it makes a lot of
that could be the move I mean this our
culture respects money in that kind of
way and it's and it's verifiable right
it's verifiable the bank account can't
exactly there's surprising evidence that
once you set up the ways of collecting
the verifiable domain that this can work
there's been a lot of research before
this R1 on math problems and they
approach math with language models just
by increasing the number of samples so
you can just try again and again and
again and you look at the amount of
times that the language models get it
right and what we see is that even very
bad models get it right sometimes and
the whole idea behind reinforcement
learning is that you can learn from very
sparse rewards so it it doesn't the the
space of language and the space of
tokens whether you're generating
language or tasks for a robot is so big
that you might say that it's like I mean
each the tokenizer for a language model
can be like 200,000 things so at each
each step it can sample from that big of
a space so if it can generate a bit of a
signal that it can climb on to that's
the what the whole field of RL is around
is learning from sparse rewards and the
same thing has played out in math where
it's like very weak models that
sometimes generate answers we see
research already that you can boost
their math scores you can do this sort
of RL training for math it might not be
as effective but if you take a 1 billion
parameter model so something 600 times
smaller than deep seek you can boost its
grade school math scores very directly
with a small amount of this training so
it's not to say that this is coming soon
setting up the verification domains is
extremely hard and there's a lot of
nuance in this but there are some basic
things that we have seen before where
it's
like it's at least expectable that
there's a domain and there's a chance
that this works all right so we have fun
things happening in real time this is a
good opportunity to talk about other
reasoning models 01
03 just now open AI as perhaps expected
released 03 mini what are we expecting
from the different flavors can you just
lay out the different flavors of um the
old models and the from Gemini the
reasoning model something I would say
about these reasoning models is we
talked a lot about reasoning training on
math and code and what is done is that
you have the base model we've talked
about a lot on the internet you do this
large scale reasoning training with
reinforcement learning and then what the
deeps paper detailed in this R1 paper
which for me is one of the big open
questions on how do you do this is that
they did reasoning heavy but very
standard post trining techniques after
the large scale reasoning RL so they did
the same things with a form of
instruction to tuning through rejection
sampling which is essentially heavily
filtered instruction tuning with some
reward models and then they did this rhf
but they made it math heavy so some of
this transfer we looked at this
philosophical example early on the one
of the big open questions is how much
does this transfer if we bring in
domains after the reasoning training are
all the models going to be become
eloquent writers by reasoning is this
philosophy stuff going to be open we
don't know in the research of how much
this will transfer there's other things
about how we can make soft verifiers and
things like this but there is more
training after reasoning which makes it
easier to use these reasoning models and
that's what we're using right now so
we're going to talk about with three
mini and 01 like these have gone through
these extra techniques that are designed
for human preferences after being
trained to elicit reasoning I think I
think one of the things that you know
people are ignoring is Google's Gemini
flash thinking is both cheaper than R1
and and better and they released it in
the beginning of December and nobody's
talking about no one cares it has a
different flavor to it its behavior is
less expressive than something like 01
it has fewer tracks than it is on quen
released a model last fall
qwq which was their preview reasoning
model and in deep SE cut R1 light last
fall where these models kind of felt
like they're on Rails where they really
really only can do math and code and 01
is it can answer anything it might not
be perfect for some tasks but it's
flexible it has some richness to it and
this is kind of the part of like how
cook like is a Model A little bit
undercooked it's like it's good to get a
model out the door but it's hard to
gauge and it takes a lot of taste to be
like is this a full-fledged model can I
use this for everything they're probably
more similar for Math and code my quick
read is that Gemini flash is like not
trained the same way as 01 but taking an
existing training stack adding reasoning
to it so taking a more normal training
stack and adding reasoning to it and I'm
sure they're going to have more I mean
they've done quick releases on Gemini
flash so reasoning and this is the
second version from the holidays it's
evolving fast
and it takes longer to make this
training stack where you're doing this
large scale the same question from uh
earlier uh the one about the the human
nature yeah what was the human nature
one uh the way I can ramble why I can
ramble about this so much is that we've
been working on this at ai2 before 01
was fully available to everyone and
before R1 which is essentially using
this RL training for fine tuning we use
this in our like Tulu series of models
and you can elicit the same behaviors
where you say like wait and so and so on
but it's so late in the training process
that this kind of reasoning expression
is much lighter so you can there's
there's essentially a gradiation and
just how much of this RL training you
put into it determines how the output
looks so uh we're now using Gemini 2.0
flash thinking experimental
121 it summarized The Prompt as humans
self-d domesticated
Apes pect okay all right so wait is this
revealing the the reasoning here's why
this is a novel okay uh cck click to
expand okay analyze the request novel is
the keyword like see how it just looks a
little different it looks like a normal
output yeah it's I mean in some sense is
better structured it makes more sense
and when it latched onto human and then
it went into organisms and oh wow apex
predator focus on
domestication apply domestication to
humans explore the idea of
self-domestication
not good not good where is this going
refine articulate the Insight graci
greater facial expressiveness and
communication ability yes plasticity and
depth ability yes dependence social
groups yes all right and it uh
self-critique and refine further wow is
this truly novel is it well supported uh
so on and so forth and the Insight is
getting at is humans are not just social
animals but profoundly
self-domestication apes and this
self-domestication is the key to
understanding our unique cognitive and
social abilities self-d domesticated
Apes self I prefer the Deep seek
response
self I mean it's novel The Insight is
novel I mean that's like a good book
title self domesticated Apes like there
could be a case made for that I mean
yeah it's cool and it's revealing uh the
reasoning it's it's magical it's magical
like this is really
powerful hello everyone this is Lex with
a quick intermission recorded after the
podcast since we reviewed responses from
Deep SE car1 and Gemini flash 2.0
thinking during this conversation I
thought at this moment it would be nice
to insert myself quickly doing the same
for open AI 01 Pro and 03 mini with the
same prompt The Prompt being give one
truly novel insight about humans and I
thought I would in general give my vibe
check and uh Vibe based anecdotal report
on my own experience
with the new o03 Mini model now that I
got a chance to spend many hours with it
in different kinds of context and
applications so I would probably
categorize this question as uh let's say
open-ended philosophical question and in
particular the emphasis on novelty I
think is a nice way to uh test one of
the capabilities of the model which is
come up with something that makes you
pause and almost surprise you with its
Brilliance so that said my General
review after running each of the models
on this question a bunch of times is
that 01 Pro consistently gave brilliant
answers ones that gave me pause and made
me think both cutting in its insight and
just really nicely phrased with wit with
Clarity with Nuance over and over
consistently generating the best answers
after that is R1 Which is less
consistent but again deliver Brilliance
Gemini flash 2.0 thinking was third and
last was uh 03 mini actually it often
gave quite a generic answer at least to
my particular sensibilities that said in
a bunch of other applications that I
tested for uh brainstorming purposes it
actually worked extremely well and often
uh outperformed R1 but on this
open-ended philosophical question it did
consistently worse now another important
element for each of these models is how
the reasoning is presented deep seek R1
shows the full Chain of Thought tokens
which I personally just love for these
open-ended philosophical questions it's
really really interesting to see the
model think through it but really also
just stepping back me as a person who
appreciates intelligence and reasoning
and reflection reading these kind of
Chain of Thought raw tokens of R1
there's something genuinely beautiful
about observing the path of deliberation
in an intelligent system I think we
don't always have that explicitly laid
out for us humans so to see it in
another intelligence system the
nonlinearity of it akin to ulyses or
finnean wake by James Joyce it's just
beautiful to watch anyway as we
discussed in the episode deep seek R1
talked about humans being able to
convert selfish desires into Cooperative
systems by collectively pretending
abstract rules like money laws and
rights are real and uh these shared
hallucinations act as games where
competition is secretly redirected to
benefit the group turning conflict into
society's fuel Gemini 2.0 flash thinking
said humans are not just social animals
but self-domestication apes and this
self-domestication is the key to
understanding our unique cognitive and
social abilities now it's important to
say that the Chain of Thought there was
really interesting it was looking
through the entire evolution of life on
Earth considering apex
predators and considering how from that
we ended up to where we are I think that
domestication by choice is a really
interesting angle again it's one of
those things when somebody presents a
different angle on a seemingly obvious
thing it just makes me smile and the
same with deepcar one that these
hallucinations of money laws and rights
and US collectively pretending like it's
real and we play games with them that
look like competition when secretly
we're just cooperating with each other
and that is the fuel of progress
beautifully put now open ai1 Pro
consistently over over delivered bangers
I can go through many of them but the
first one was uh humans are the only
species that turns raw materials into
symbolic resources then uses those
symbols to reorganize the very materials
they came from creating a Clos feedback
loop between meaning and matter here I
just ran it
again Banger after Banger I'm telling
you humans are unique among known
species in that they simultaneously
rewrite two layers of reality the
external world and their own private
mental Landscapes and then merge these
two Rewritten layers into a continuous
personal narrative that feels
objectively true feels true it's this is
poetry okay and then 03 mini high for me
was smart fast
actually and uh kind of generic never
quite got there for me so here's the
first one I got from 03 mini humans are
not fixed beings but rather ongoing
narratives Dynamic stories that we
continuously write edit and reinterpret
this narrative plasticity is more than
just memory or self-reflection it's it's
an intrinsic cognitive process that acts
like an internal error correction system
it allows us to adapt our identities and
values over time in response to new
experiences challenges and social
contexts now it almost sneaks up to
something approximating cutting Insight
with uh narrative plasticity in quotes
but then it goes back to the sort of the
generic I don't know all of these models
are incredible for different reasons
there's a lot of concerns as we
discussed in this episode but there's uh
a lot of reasons to be excited as well
and I probably spoken for too long I am
severely sleep deprived borderline
Delirious so hopefully some of this made
sense and now dear friends back to the
episode I I think I think when you you
know to Nathan's point when you look at
like the reasoning models um to me even
when I used R1 versus o1 there was like
that sort of rough edges around the
corner feeling right um and Flash
thinking you know earlier I didn't use
this version but the one from December
and it definitely had that rough edges
around the corner feeling right where
it's just not fleshed out in any as many
ways right um sure they added math and
coding capabilities via these verifiers
in RL but you know they M it feels like
they lost something in certain areas and
01 is worse performing than chat in many
areas as well to be clear um not by a
lot not by a lot though right and it's
like some of like R1 definitely felt to
me like it was worse than V3 in certain
areas like doing this RL expressed and
learned a lot but then it weakened in
other areas and so I think that's one of
the big differences between these models
and then and and and what o1 offers and
then open AI has 01 Pro and what they
did with 03 which is like also very
unique is that they stacked search on
top of Chain of Thought right um and so
Chain of Thought is one thing where it's
able it's one chain it backtracks goes
back back and forth but how they Sol
solved the AR AGI challenge was not just
the chain of thought it was also
sampling many times I.E running them in
parallel and then selecting is running
in parallel actually search because I I
don't know if we have the full
information on how o1 Pro works so like
I'm not I don't have enough information
to confidently say that it is search it
is parallel samples yeah and then it
select something and we don't know what
the selection function is the reason why
we're debating is because since 01 was
announced there's been a lot of interest
in techniques called Monte caros
research which is where you will break
down the chain of thought into
intermediate steps we haven't defined
Chain of Thought Chain of Thought is
from a paper from years ago where you
introduce the idea to ask a language
model that at the time was much less
easy to use you would say let's verify
step by step and it would induce the
model to do this bulleted list of steps
Chain of Thought is now almost a default
in models where if you ask it a math
question you don't need to tell it to
think step by step and the idea with
Monte Carlo research is that you would
take an intermediate point in that train
do some sort of expansion spend more
compute and then select the right one
that's like a very complex form of
search that has been used in things like
muzo and Alpha zero potentially I know
muzo does this another form of search is
just asking five different people and
then taking the majority answers right
there's a variety of like you know it
could be complicated it could be simple
we don't know what it is just that they
are they are not just issuing one Chain
of Thought in sequence they're launching
many in parallel and in the arc AGI they
launched a thousand in parallel for
their uh the one that like really
shocked everyone that beat The Benchmark
was they La they would launch a thousand
in parallel and then they would get the
right answer like 80% of the time or 70%
of the time 90 maybe even uh whereas if
they just launched one it was like 30%
there are many extensions to this I
would say the simplest one is that our
language models to date have been
designed to give the right answer the
highest percentage of the time in one
response and we are now opening the door
to different ways of running inference
on our models in which we need to
re-evaluate many parts of the training
process which normally opens the door to
more progress but we don't know if open
AI changed a lot or if just sampling
more in multiple choice is what they're
doing or if it's something more complex
where they Chang the training and they
know that the inference mode is going to
be different so we're talking about 01
Pro $200 a month and they're losing
money
so the thing that we're referring to
this F fting exploration of the test
time compute
space is that actually possible do we
have enough compute for that does the
financials make sense so the Fantastic
thing is and and and there it's in the
uh thing that I pulled up earlier but uh
the cost for uh gpt3 has plummeted if
you scroll up uh just a few images I
think the important thing about like hey
is cost a limiting factor here right
like my my my view is that like we'll
have like really awesome intelligence
before we have like AGI before we have
it permeate throughout the economy um
and this is sort of why that reason is
right gpt3 was trained in what 2020 2021
um and the cost for running inference on
it was $60 $70 per million tokens right
um which was the cost per intelligence
was ridiculous um now as we scaled
forward two years we've had a 1200X
reduction in cost to achieve the same
level of intelligence as gpt3 so uh here
on the x-axis is time
over just a couple of years and on the Y
AIS is log
scale dollars to run inference on on a
million tokens yeah million and so you
have just uh a down like a linear
decline on log scale uh from gpt3
through 35 to llama it's like 5 cents or
something like that now right which is
which is versus versus $60 1200X that's
not the exact numbers but it's 1200X I
remember that number is is the humongous
humongous cost per intelligence right
now the freak out over deep seek is oh
my God they made it so cheap it's like
actually if you look at this trend line
they're not below the trend line first
of all and at least for gpt3 right uh
they are the first to hit it right which
is which is a big deal um but they're
not below the trend line as far as gpt3
now we have GPD 4 what's going to happen
with these reasoning capabilities right
it's a mix of architectural Innovations
it's a mix of better data and it's going
to be better training techniques and all
of these different better inference
systems uh better Hardware right uh
going from you know each generation of
GPU to new generations or A6 everything
is going to take this cost curve down
and down and down and down and then can
I go in can I just spawn a thousand
different llms to create a task and then
pick from one of them or you know
whatever search search technique I want
a tree Monte Carlo tree search maybe it
gets that complicated um maybe it
doesn't because it's too complicated to
actually scale like who knows uh bitter
lesson right uh the the question is is I
think when not if because the rate of
progress is so fast right um 9 months
ago Dario was saying Hey or you know
Dario said 9 months ago the cost to
train and inference was this right um
and now we're much better than this
right um and deep seek is much better
than this and and that cost curve for
gp4 which was also roughly $60 per
million tokens when it launched has
already fallen to you know $2 or so
right and we're going to get it down to
cents probably for gp4 quality and the
same and then G that's that that's the
base for uh the reasoning models like 01
that we have today and 01 Pro is
spawning more multiple right and 03 and
you know so on and so forth these search
techniques too expensive today but they
will get cheaper and that's that's
what's going to unlock the intelligence
right so it get cheaper and cheaper and
cheaper the the big deep seek R1 release
freaked everybody out because of the
cheaper one of the manifestations of
that is Nvidia stock plummeted uh can
you explain what happened I mean and
also just explain this moment and
whether you know if Nvidia is going to
keep winning we're both Nvidia Bulls
here I would say and in some ways the
market response is reasonable most of
the market like nvidia's biggest
customers in the US are major tech
companies and they're spending a ton on
AI and if a simple interpretation of
deep seek is you can get really good
models without spending as much on AI so
in that capacity it's like oh maybe
these big tech companies won't need to
spend much in Ai and go down the actual
thing that happened is much more complex
where there's social factors where
there's the rising in the App Store the
social contagion that is happening and
then I think a lot some of it is just
like I'm not I don't trade I don't know
anything about financial markets but it
builds up over the weekend or the social
pressure where it's like if it was
during the week and there was multiple
days of trading when this was really
becoming but it comes on the weekend and
then everybody wants to sell and that is
a social contagion I think I think and
like there were a lot of false
narratives which is like hey guys are
spending billions on models right and
they're not spending billions on models
no one spent more than a billion dollars
on a Model that's released publicly
right gp4 was a couple hundred million
and then you know they've reduced the
cost with 40 all four turbo 40 right um
but billion dollar model runs are coming
right um this concludes pre-training and
post-training right and then the other
number is like hey deep seek didn't
include everything right they didn't
include you know a lot of the cost goes
to research and all this sort of stuff a
lot of the cost goes to inference a lot
of the cost goes to post training none
of these things were factored research
salaries right like all these things are
like counted in the billions of dollars
that open is spending but they weren't
counted in the you know hey 6 million 5
million that deep seek spent right so
but so there's a bit of misunderstanding
of what these numbers are um and then
there's also an element
of Nvidia has just been a straight line
up right and and there's been so many
different narratives that have been
trying to push down Nvidia not I don't
say push down Nvidia stock everyone is
looking for a reason to sell or to be
worried right um you know it was it's it
was Blackwell delays right their GPU was
you know there's a lot of report every
two weeks there's a new report about
their gpus being delayed um there's um
there's the whole thing about scaling
laws ending right it's so it's so ironic
right it lasted a month it was it was
just it was just like literally just hey
models aren't getting better right
they're just not getting better there's
no reason to spend more pre-training
scaling is dead and then it's like 01 03
right R1 R1 right and now it's like wait
models are getting too they're
progressing too fast slow down the
progress stop spinning gpus right but
you know the funniest thing I think that
like comes out of this is javon's
paradox is true right AWS pricing for
h100s has gone up over the last couple
weeks right since since since since a
little bit after Christmas since V3 was
launched AWS h100 pricing has gone up
h20s are like almost out of stock
everywhere because it you know h200 has
more memory and therefore R1 like you
know wants that chip over h100 right we
were trying to get gpus on a short
notice this week for a demo and it
wasn't that easy we were trying to get
just like 16 or 32 h100s for demo and it
it will not very easy so for people who
don't know Jon's Paradox is uh when uh
you know the efficiency goes up somehow
magically counterintuitively the Total
Resource consumption goes up as well
right and semiconductors is you know
we're I 50 years of mors law every two
years half the cost double the
transistors just like clockwork and it's
slowed down obviously but like the
semiconductor industry has gone up the
whole time right they it's been wavy
right there's obviously and stuff and I
don't expect AI to be any different
right there's going to be and flows but
this is in AI it's just playing out at
an insane time scale right it was 2x
every two years this is 1200X in like
three years right so it's like the the
scale of improvement that is like hard
to get wrap your head around yeah I was
confused because I to me Nvidia thought
on that should have gone up but maybe
went down because there's kind of
Suspicion of fall play on the side of
China or something like this but if you
just look purely at the actual
principles that play here like it's
obvious yeah Javon par more progress
that AI makes or the higher the
derivative of AI progress is especially
you should because Nvidia is in the best
place the higher the derivative is the
sooner the Market's going to be bigger
and expanding and Nvidia is the only one
that does everything reliably right now
because it's not like an Nvidia
competitor arose it's it's another
company that's using Nvidia who
historically has been a large Nvidia
customer customer yeah and has press
releases about them cheering about being
China's biggest Nvidia customer right
like yeah it me obviously they've
quieted down but like I think that's
like another element of is that they
don't want to say how many gpus they
have yeah because hey they yes they have
H 800s yes they have h20s they also have
some h100s right which were smuggled in
can you speak to that to the smuggling
what's the scale of smuggling that's
feasible for a nation state to do for
companies is it possible to think I
think there's a few angles of smuggling
here right one is bite dance arguably is
the largest Smuggler of gpus for China
right China's not supposed to have gpus
bite dance has like over 500,000 gpus
why because they're all rented from
companies around the world they rent
from Oracle they rent from Google they
rent from all these mass and and a bunch
of smaller Cloud companies too right all
the neoc clouds right of the world they
rent so so many GPS they also buy a
bunch right and and they do this for
mostly like what meta does right serving
Tik Tok right serving next best same
same as right to be clear that's today
the view use right and it's a valid use
right hack the dopamine circuit right um
now that's that's theoretically now very
much restricted with the AI diffusion
rules which happened in the last week at
the Biden admin and uh Trump admin looks
like they're going to keep them which
limits like allies even like Singapore
um which Singapore is like 20% of
invidious 20 20 30% of idious Revenue
but uh Singapore's had a memorium on not
building data centers for like 15 years
because they don't have enough power so
where are they
going I mean I'm not claiming they're
all going to China right but a portion
are you know many are going to Malaysia
um including Microsoft and Oracle have
big data centers in Malaysia like you
know all they're going all over
southeast Asia probably India as well
right like there's stuff routing but
like the diffusion rules are very
defacto like you can only buy this many
gpus from this country and it's and you
can only rent a cluster of this large to
companies that are Chinese right like
they're very explicit on trying to stop
smuggling right and a big chunk of it
was hey let's let's you know random
Company by 16 servers ships them to uh
to to China right um there's actually I
I saw a photo from someone uh in the
semiconductor industry who who's an who
leads like a a team for like networking
chips uh that competes with Nvidia and
he sent a photo of a guy checking into a
first class United flight from San
Francisco to to Shanghai or shenzen with
a a super micro box that is this big
which can only contain gpus right and he
was booking first class cuz think about
it 3 to 5K for your first class ticket
server cost you know 240,000 in the US
250,000 you sell it for 300,000 in China
wait you just got a free first class
ticket and a lot more money so it's like
you know and that's like small scale
smuggling most of the large scale
smuggling is like companies in Singapore
and Malaysia like routing them around or
renting gpus completely legally I want
to jump in how much do the scale I think
there's been some number like some
people that have higher level
economics understanding say that like as
you go from 1 billion of smuggling to 10
billion it's like you're hiding certain
levels of economic activity and that's
the most reasonable thing to me is that
there's going to be some level where
it's so obvious that it's easier to find
this economic activity and yeah so so so
my my my belief is that last year
roughly uh so so Nvidia made a million
h20s which are legally allowed to be
shipped to China which we talked about
is better for reasoning right inference
at least um not maybe not not training
but reasoning inference um and inference
generally that they also had you know a
couple hundred thousand we think like
200 to 300,000 gpus were routed to China
from you know Singapore Malaysia us
wherever companies spawn up by 16 gpus
64 gpus whatever it is Route it and
Huawei is known for having spent up a
massive network of like companies to get
the materials they need after they were
banned in like 2018 so it's not like
otherworldly uh but I agree right n
Nathan's point is like hey you can't
smuggle A10 billion of gpus uh and then
the third sort of source which is just
now banned and you know which wasn't
considered smuggling but is China is
renting like is I I I I believe from our
research right oracle's biggest GPU
customer is bite dance right and and and
and for Google I think it's their second
biggest customer right and so like and
you go down the list of clouds and
especially these smaller Cloud companies
that aren't like the hyperscalers right
um think Beyond cor Lambda even there's
a whole C there's 60 different new Cloud
companies serving Nidia gpus I think B
dance is renting a lot of these right um
all over right and so these companies
are renting gpus to Chinese companies
and that's completely that was
completely legal up until the diffusion
rules which happened just a few weeks
ago and even now you can rent GPU
clusters that are less than 2,000 gpus
or you can buy gpus and ship them
wherever you want if you're if they're
less than 1500 gpus right so it's like
there are still like some ways to
smuggle but yeah it's not you know as
the numbers grow right uh you know 100
something billion dollars of revenue for
NVIDIA last year 200 something billion
this year right and if next year or you
know it could it could nearly double
again or more than double right based on
like what we see with data center
Footprints like being built out all
across the US and the rest of the world
it's going to be really hard for China
to keep up with these rules right yes
there will always be smuggling um and
deep- seek level models of gp4 level
models uh 01 level models capable to
train on what China can get even the
next tier above that but if we speedrun
a couple more you know jumps right you
know to billion dollar models 10 billion
dollar models then it becomes you know
hey there is a compute disadvantage for
China for training models and serving
them and and the serving part is really
critical right deep seek cannot serve
their model today right it's it's
completely out of inventory uh it's
already started falling in the App Store
actually downloads because you download
it you try and sign up they say we're
not taking registrations because they
have no capacity right you open it up
you get like less than five tokens per
second if you even get your request
approved right because there's just no
capacity because they just don't have
enough gpus to serve the model even
though it's incredibly efficient it
would be fascinating to watch the
smuggling cuz I mean there's drug
smuggling right that's a that's a market
there's weapons smuggling and gpus will
surpass that at some points are highest
value per kilogram probably by
far um um I have another question for
you D do you track uh model API access
internationally how how easy is it for
Chinese companies to use hosted model
apis from the US yeah I mean that's
incredibly easy right like open AI
publicly stated deep seek uses their API
and as they say they have evidence right
and this is this is another element of
the training regime is people at open AI
have claimed that it's a distilled model
I.E you're taking open ai's model you're
generating a lot of output and then
you're training on the output in their
model um and even if that's the case
what they did is still Amazing by the
way what deeps did efficiency wise
distillation is standard practice in
Industry whether or not if you're at a
closed lab where you care about terms of
service and IP closely you distill from
your own models if you are a researcher
and you're not building any products you
distill from the opening up this is a
good opportunity can you explain big
picture distillation as a process what
what is distillation what's the process
of dis talk a lot about training
language models they are trained on text
and post training you're trying to train
on very high quality text that you want
the model to match the features of or if
you're using RL you're letting the model
find its own thing but for supervis fine
tuning for preference data you need to
have some completions what the model is
trying to learn to imitate and what you
do there is instead of a human data or
instead of the model you're currently
training you take completions from a
different normally more powerful model I
think there's rumors that these big
models that people are waiting for these
GPT 5S of the world the cloud 3 opuses
of the world are used internally to do
this distillation process there's also
public examples right like meta
explicitly stated not necessarily
distilling but they used 405b as a
reward model for 70b in their llama 3.2
or 3.3 this is all the same topic so is
this uh is this ethical is this legal
like why why is that Financial Times
article headline say open AI says that
there's evidence that China's deep seek
used its model to train competitor this
is a long at least in the academic side
and research side it's a long history
because you're trying to interpret open
ai's rule open ai's terms of service say
that you cannot build a competitor with
outputs from their models terms of
service are different than a license
which are essentially a cont between
organizations so if you have a terms of
service on open ai's account if I
violate it open AI can cancel my account
this is very different than like a
license that says how you could use a
downstream artifact so a lot of it
hinges on a word that is very unclear in
the AI space which is what is a
competitor so and then the ethical
aspect of it is like why is it unethical
for me to train on your model when you
can train on the internet's text yeah
right so there's a bit of a hypocrisy
because sort of open Ai and potentially
most of the companies trained on the
internet's text without permission
there's also a clear loophole which is
that uh I generate data from open Ai and
then I upload it somewhere and then
somebody else trains on it and the link
has been broken like they're they're not
under the same terms of service contract
this is this is why there's a lot of
hipop there's a lot of like to be
discovered details that don't make a lot
of sense this is why a lot of models
today even if they train on zero open AI
data you ask the model who trained you
it'll say I was I'm Chad P trained by
open because there's so much copy paste
of like open a outputs from that on the
internet that you just weren't able to
filter it out and in the and there was
nothing in the RL where you they
implemented like hey like or post
training or sft whatever that says hey
I'm actually uh modeled by Allen
Institute instead of uh we have to do
this if we serve a demo we do research
and we use open a apis because it's
useful and we want to understand post
training and like our research models
they will say they're written by open AI
unless we put in the system prop that we
talked about that like I am Tulu I am a
language model trained by the Allen
Institute for AI and if you ask more
people around industry especially with
posttraining it's a very doable task to
make the model say who it is or to
suppress the open AI thing so in some
levels it might be the Deep seek didn't
care that it was saying that it was by
open AI like if you're going to upload
model weights it doesn't really matter
because anyone that's serving it in an
application and cares a lot about
serving is going to when serving it if
they're using it for a specific task
they're going to tailor it to that and
it doesn't matter that it's saying it's
chbt oh I guess I guess one of the ways
to do that is like a system prompt or
something like that like if you're
serving it to say that you're that's
what that's what we do like if we host
the demo you say you are Tulu three a
language model trained by the Allen
Institute for AI we also are benefited
from open AI data because it's a great
research tool I mean do you think
there's any any truth and value to the
the claim open ai's claim that there's
evidence that China's deep seek use this
model to train I think everyone has
benefited regardless because the data is
on the internet um and therefore it's in
your P training now right there are like
subreddits where people share the best
chat GPT outputs and those are those are
in your I think that they're trying to
ship the narrative like they're trying
to protect themselves and we saw this
years ago when bite dance was actually
banned from some open a apis for
training on outputs there's other AI
startups that most people if you're in
the like AI culture were like they just
told us they trained on opening eye
outputs and they never got banned like
that's how they bootstrapped their early
models so it's much easier to get off
the ground using this than to set up
human pipelines and build a strong model
so there long history here and a lot of
the communications are seem like
narrative control actually like the over
the last couple days we've seen a lot of
people distill deep seeks model into
llama models because because the Deep
seek models are kind of complicated to
run inference on because their mixture
of experts and their you know 600 plus
billion parameters and all this and
people distilled them into the llama
model and then because the Llama models
are so easy to serve and everyone's
built the pipelines and tooling for
inference with the Llama models right
because it's the open standard so you
know we've seen it we've seen a sort of
roundabout right like is it is it bad is
it illegal maybe it's illegal whatever I
don't know about that but like it could
break contracts I don't think it's
illegal like in any legal like no one's
going to jail for this ever I I think
like fundamentally I think it's ethical
or I hope it's ethical because like the
moment becomes we ban that kind of thing
it's going to make everybody much worse
off and I also actually it's this is
difficult but I think you should be
allowed to train on the internet I know
a lot of authors and creators are very
sensitive about it that's that's a
difficult question but like the mo the
moment you're not allowed to train on
the internet I agree I I have a skitso
take on how you can solve this because
it already works I have a reasonable
take out all right all right so so you
know Japan has a law which you're
allowed to train on any training data
and copyrights don't apply if you want
to train a Model A B Japan has 9 gaw of
curtailed nuclear power C Japan is
allowed under the AI diffusion rule to
import as many gpus as they'd like so
all we have to do we we have a market
here to make we build massive data
centers we rent them to the labs and
then we train models in a legally
permissible way and there's no if ands
or butts and now the models have no like
potential copyright lawsuit from New
York Times or anything like that no no
it's just like completely legal no so so
so genius the early copyright lawsuits
have fallen in the favor of AI training
I would say that the long tale of use is
going to go ins the side of AI which is
if you do if you scrape trillions of
data you're not looking at the trillions
of tokens of data you're not looking and
saying this one New York Times article
is so important to me but if you're
doing a audio generation for music or
image generation and you say make it in
the style of xers that's a reasonable
case where you could figure out what is
their profit margin on inference I don't
know if it's going to be the 50/50 of
YouTube Creator program or something but
I would opt into that program as a
writer like please like like that it's
just it's going to be a rough Journey
but there will be some solutions like
that that makes sense but there's a long
tail where it's just on the internet I
think one of the other aspects of that
Financial Times article
implied and so that leads to a more
general question do you think
there's how difficult is is uh spying
Espionage and stealing of actual secret
code and data from inside companies how
much of that is being attempted code and
data is hard but ideas is easy Silicon
Valley operates on the on the way that
top employees get bought out by other
companies for a pay raise and a large
reason why these companies do this is to
bring ideas with them and there are
there's no I mean in California there's
rules that like certain like
non-competes or whatever are illegal in
California and whether or not there's
ndas and things that is how a lot of it
proc happens recently there was somebody
from Gemini who help make this 1 million
context length and everyone is saying
the next llama who I mean he went to the
meta team is going to have 1 million
context length and that's kind of how
the world works you know as far as like
industrial Espionage and things that has
been greatly successful in the past
right um you know the Americans did it
to the Brits uh the Chinese have done it
to the Americans right and you know so
on and so it's just it is a fact of life
um and so like to argue industrial
Espionage can be stopped is probably
unlikely you can make it difficult but
even then like there's all these stories
about like hey f F35 and F-22 have
already been like sort of like given to
China in terms of design plans and stuff
um code and stuff like between you know
I say companies not nation states is
probably very difficult um but ideas are
discussed a lot right whether it be a
house party in San Francisco or a
company changing employees or you know
or the you know the the always the like
mythical honey pot that always gets
talked about right like someone gets
honey potted right uh because everyone
working on AI is a single dude who's in
their 20s and 30s not everyone but like
a insane amount of insane percentages um
so there's always like all these like
you know and and obviously so honey
poter is like a a spy a female spy
approaches you and like yeah yeah or or
or male right you know it's San
Francisco right but um as a single dude
I will say in his late 20s right is like
we are very easily corrupted right like
you know like not not not corrupted
myself but you know like we are we are
right everybody else not me I'm too
oblivious and I am not single so I'm
safe from one Espionage
access yeah you have to make sure to
close all security
vulnerabilities so you uh Dylan collect
a lot of information about each of the
the mega clusters for each of the major
AI companies can can you uh talk about
the buildout
for each one that stand out yeah so I
think the thing that's like really
important about these Mega cluster build
outs is they're completely unprecedented
in scale right um us you know sort of
like data center power consumption has
been slowly On The Rise and it's gone up
to 23% even through the cloud computing
Revolution right data center consumption
as a percentage of total us and and
that's been over decades right of data
centers Etc it's been climbing climbing
slowly but now 2 to 3% now by the end of
this decade it's like even even under
like you know when I say like 10% a lot
of people that are traditionally uh by
like 2028 2030 people traditionally non
a uh traditional data center people like
that's nuts but then like people who are
in like AI who have like really looked
at this at like the anthropics and open
AI they're like that's not enough and
I'm like okay but like you know this is
this is both through uh globally
distributed uh and or distributed
throughout the us as well as like
centralized clusters right the the
distributed throughout the US is is
exciting and it's the bulk of it right
like hey you know uh openi or you know
say meta is adding a gwatt right um but
most of it is distributed through the US
for inference and all these other things
right so maybe we should lay out what a
what a cluster is so uh you know does
this include AWS maybe it's it's good to
talk about the different kinds of
clusters and what you mean by Mega
clusters and what's a GPU and what's a
computer and what kid not that far back
but yeah so like what do we mean by the
Clusters I thought I was about to do the
Apple ad right what's a
computer so so traditionally data
centers and data center tasks have been
a distributed systems problem that is uh
capable of being spread very far and
widely right I.E I send a request to
Google it's gets routed to a data center
somewhat close to me um it does whatever
search ranking recommendation sends a
result back right um the nature of the
task is changing rapidly in that the
task there's two tasks that people are
really focused on now right it's not
database access it's not serve me the
right page serve me the right ad it's
now a inference and inference is
dramatically different from traditional
distributed systems but it looks a lot
more simple simp similar and then
there's training right the train
inference side is still like hey I'm
going to put you know thousands of gpus
in in you know blocks all around these
data centers I'm going to run models on
them you know user submits a request
gets kicked off or hey my service you
know they submit a request to my service
right they're on word and they're like
oh yeah help help me co-pilot and it
starts kicks it off I'm on my windows
co-pilot whatever Apple intelligence
whatever it is it gets kicked off to a
data center right and that data center
does some work and sends it back that's
inference that is going to be the bulk
of compute but then you know that and
that's like you know there's thousands
of data centers that we're tracking with
Like Satellites and like all these other
things and and those are the bulk of
what's being built but the scale of and
and and so that's like what's really
reshaping and that's what's getting
millions of gpus but the scale of the uh
largest cluster is also really important
right um when we look back at history
right like you know or through through
the age of AI right like it was a really
big deal when they did alexnet on I
think two gpus or four gpus I don't
remember it's a really big deal it's a
big deal because you use gpus it's a big
deal they used gpus um and they used
multiple right but then over time it
scale has just been compounding right
and so when you skip forward to gpt3
then gp4 gp4 20,000 a100 gpus
unprecedent Ed run right in terms of the
size and the cost right couple hundred
million on a YOLO right a YOLO run for
GPD 4 and it and it yielded you know
this magical Improvement that was like
perfectly in line with what was
experimented and just like a log scale
right oh yeah they have that plot from
the paper the technical per the scaling
laws were perfect right but that's not a
crazy number right 20,000 A1 100s uh
roughly each GPU is consuming 400 watts
uh and then when you add in the whole
server right everything um it's like 15
to 20 megawatts up power right uh you
know you know maybe you could look up
what the power of consumption of a human
a person is because the numbers are
going to get silly but like that 15 to
20 megawatts was standard data center
size it was just unprecedented that was
all gpus running one Tas 20 watts was a
toaster toaster is like a similar power
consumption to an a100 Right h100 comes
around they increase the power from like
400 to 700 watts and that's just per GPU
and then there's all the associated
stuff around it so once you count all
that it's roughly like 1,200 to 1400
Watts for everything networking CPUs
memory blah blah blah so we should also
say so what's
required you said power so a lot of
power is required a lot of heat is
generated so cooling is required and uh
because there's a lot of gpus that have
to be or CPUs or whatever they have to
be connected so there's a lot of
networking yeah right yeah so I think
yeah sorry for uh skipping past that and
then the data center itself is like
complicated right but these are still
standardized data centers for gp4 scale
right now we step forward to sort of
what is the scale of clusters that
people have built last year right and it
ranges widely right it ranges from like
hey these are standard data centers and
we're just using multiple of them and
connecting them together really with a
ton of fiber between them a lot of
networking Etc that's what open Ai and
Microsoft did in Arizona right and so
they have a you know 100,000 gpus right
meta similar thing they took their
standard existing data center design um
and it looks like an h and they
connected multiple of them together um
and you know they got to they first did
16,000 gpus uh 24,000 gpus total only 16
of them thousand of them were running on
the training run because gpus are very
unreliable so they need to have spares
to like swap in and out all the way to
like now 100,000 gpus that they're
training on llama 4 on currently right
like 128,000 or so right this is you
know think about 100,000 gpus um with
roughly 1,400 Watts a piece that's
that's that's 140 megawatts 150
megawatts right for 128 right so you're
talking about you've jumped from 15 to
megawatts to 10x you know almost 10x
that number 9x that number to 150
megawatts in in two years right from
2022 to 2024 right and some people like
Elon that he he he admittedly right and
he says it himself got into the game a
little bit late for pre-training large
language models right xai was started
later right but then he he bent Heaven
and Hell to get his data center up and
get the largest cluster in the world
right which is 200,000 gpus um and and
and he did that he bought a factory in
Memphis uh he up upgrading the
substation at the same time he's got a
bunch of mobile power generation a bunch
of single cycle combine he tapped the
natural gas line that's right next to
the factory and he's just pulling a ton
of gas burning gas he's generating all
this power he's in a factory in an old
Appliance Factory that's shut down and
moved to China long ago right like you
know and and and he's got 200,000 gpus
in it and now what's the next scale
right like all all the hypers scalers
have done this now the next scale is is
is something that's even bigger right
and so you know Elon just to stick on
the topic he's he's building his own
natural gas plant like a proper one
right next door he's he's deploying tons
of Tesla mega pack batteries to make the
power more smooth and all sorts of other
things he's got like industrial chillers
to cool the water down because he's
water cooling the chips um so all these
crazy things to uh get the Clusters
bigger and bigger um but when you look
at like say what open AI did with
Stargate that's that in Arizona in um in
abene Texas right uh what they've
announced at least right it's not built
right Elon says they don't have the
money you know there's some debates
about this um but at full scale at least
the first section is like definitely
money's accounted for but there's
multiple sections but full scale that
data center is going to be 2.2 gwatt
right 2200 megawatts of power in and
roughly like 1.8 gaws or 1,800 uh Mega
uh yeah 1,00 megawatts of power
delivered to chips right now this is an
absurd scale 2.2 gws is like more than
most cities right you know to be clear
um and delivered to a single cluster
that's connected to do training right um
to train these models to do both the
pre-training the post trining all of
this stuff right this is insane it is
what is a nuclear power plant again and
everyone is doing this right everyone is
doing this right Meta Meta and Louisiana
right they're building two natural gas
plants massive ones uh and they're and
then they're building this massive data
center um Amazon has like plans for this
scale uh Google has plans for this scale
um xai has plans for the scale right
like all of these the guys that are
racing the companies that are racing are
racing hard and they're doing multi-
gigawatt data centers right um to to
build this out because they they think
that yeah if I if I now have you know
obviously pre-training scaling is going
to continue but to some extent but then
also all this post trining stuff where
you have RL sandbox for computer use or
whatever right like you know this is
where they're going to and all these
verif viable domains where they just
keep learning and learning and learning
selfplay whatever whatever it is makes
the AI so much more capable because the
line does go up right uh as you throw
more compute you get more performance
the shirt is about scaling laws um you
know to some extent it is diminishing
returns right you 10x the compute you
don't get 10x better model right you get
a diminishing returns but also you get
efficiency improvements so you bend the
curve right um and these scale of data
centers are doing you know wre wreaking
you know a lot of like havoc on the
network right you know n Nathan was
mentioning there's Amazon has tried to
buy this nuclear power plant Talon um
and if you look at Talon stock it's just
like skyrocketing and um you know like
they're build a massive multi- gwatt
data center there and you know you just
go down the list there's so many
ramifications interesting thing is like
certain regions of the US transmitting
power cost more than actually generating
it right because the grid is so slow to
build and the demand for power and the
ability to build power and like ramping
on a natural gas plant or even a coal
plant is like easy enough to do but like
transmitting the power is really hard so
in some parts of the US like in Virginia
it cost more to transmit power than it
cost to generate it which is like you
know there's there's all sorts of like
second order effects that are insane
here can the power grid support this
kind of growth you know Trump's
executive orders there was a there was a
Biden executive order before the end of
the year but then Trump had some more
executive orders which uh hopefully
reduced the regulations to where yes
things can be built um but yeah this is
a big big challenge right is building
enough power fast enough are are you
going to basically have a nuclear power
plant next to a data center for each one
of these so so the fun thing here is
this is too slow to build the power
plant to build a power plant or to re
configure an existing power plant is too
slow and so therefore you must use natur
data center power consumption is flat
right you know I mean like it's which is
why nuclear is also good for it like
longterm nuclear is a very natural fit
but need a short you can't do solar in
anything in the short term like that
because data center power is like this
right like you're telling me you know
I'm going to buy tens of billions of
dollars of gpus and idle them because
the power is not being generated like
power is cheap right like if you look at
the cost of a cluster less than 20% of
it is power right uh most of it is the
capital cost and depreciation of the
gpus right and so it's like well screw
it I'll just like you know I'll just
build natural gas plant this is what
meta is doing in Louisiana this is what
open AI is doing in in Texas and like
all these different places they may not
be doing it directly uh but they are
partnered with someone and so there is a
couple hopes right like one is you know
and Elon what he's doing in Memphis is
like you know to the extreme they're not
just using dual combine cycle gas which
is like super efficient he's also just
using single cycle and like mobile
generators and stuff Which is less
efficient um but he's you know there's
also like the flip side which is like
solar power generation is like this and
wind is another like like this different
correl you know different so if you
stack both of those plus you get a big
chunk of batteries um plus you have a
little bit of gas it is possible to run
it more green it's just the time scales
for that is slow right so people are
trying but you know meta basically said
whatever don't care about my
sustainability pledge or they'll buy
like a per power it's called a PPA power
purchasing agreement where there'll be a
massive Wind Farm or solar farm like
wherever and then they'll just pretend
like those electrons are being consumed
by the data center but in reality
they're paying for the power here and
selling it to the grid and they're
buying power here um and then another
thing is like Microsoft quit on some of
their sustainability pledges right Elon
uh he what he did with Memphis is
objectively somewhat dirty but he's also
doing it in an area where there's like a
bigger natural gas plant right next door
and like a sewer next or not a sewer but
like a wastewater treatment and a
garbage dump nearby right and and and
he's he's obviously made the world a lot
more clean than that one data center is
going to do right so I think like it's
fine uh to some extent and maybe AGI
solves you know global warming and stuff
right whatever it is um you know this is
this is sort of the attitude that people
at the labs have right which is like
yeah SC we'll just use gas right because
the race is that important and if we
lose we you know that's way worse right
I should say that uh I got a chance to
visit um the Memphis data center oh wow
and it's uh kind of incredible I mean I
visited with with
Elon just the team themes and the rate
of innovation there is insane cuz my
sense is that you know nobody's ever
done anything of this scale and nobody
has certainly ever done anything of this
scale at the rate that XI is doing so
they're like figuring out I mean it's I
sitting in on all these meetings with
their brainstorming it's like it's
insane it's exciting because they're
like they're trying to figure out what
the bottlenecks are how to remove the
bottlenecks how to make sure that you
know there's just so many really cool
things about putting together a data
center cuz you know everything has to
work it's uh the the people that do like
the CIS admin you know the machine
learning all that is the exciting thing
so on but really the people that run
everything are the the folks that know
like the
lowlevel uh software and Hardware that
runs everything the networking all of
that and so you have to like make sure
you have procedures that test everything
I think they're using ethernet I don't
know how they're doing that working but
they're using Nvidia Spectrum X ethernet
um there's actually like I think yeah
the unsung heroes are the cooling and
electrical systems which are just
glossed over um but I think like like
one story that maybe is like exemplifies
how insane this stuff is is uh when
you're training right um you're always
doing you're you're you're running
through the model a bunch right in the
most simplistic terms running through
the model a bunch and then you're uh
you're going to exchange everything and
synchronize the weights right so you do
you'll do a step this is like a step in
model training right and every step your
loss goes down hopefully and it doesn't
always but um you in the simplest terms
you'll be Computing a lot and then
you'll exchange right the interesting
thing is GPU power is most of it
networking power is some but it's a lot
less but so while you're Computing your
power for your gpus is here but then
when you're exchanging weights uh if
you're not able to overlap
Communications and compute perfectly
there may be a time period where your
gpus are just idle and you're exchanging
weights and you're like hey the model's
updating so you're exchanging the
gradients you do the model update and
then you you start training again so the
power goes mhm right and it's super
spiky and so funnily enough right like
this when you talk about the scale of
data center power right you can blow
stuff up so easily um and So Meta
actually has accidentally open
upstreamed something to code and pytorch
where they added an operator and I kid
you not whoever made this like I want to
hug the guy because it says says pytorch
uh it's like py torch. PowerPlant no
blowup equal zero or equal one and and
what it does what it does is amazing
right either you know when you're when
you're exchanging the weights the GPU
will just compute fake numbers so the
power doesn't Spike too much and so then
the power plants don't blow up because
the transient spikes like screw stuff up
well that makes sense I mean you have to
do that kind of thing you have to make
sure they're not idle yeah an Elon
solution was like let me throw a bunch
of Tesla Mega packs and a few other
things right like there everyone has
different solutions but like metas at
least was publicly and openly known
which is just like set this operator and
what this operator does is it just makes
the gpus compute nothing so that the
power doesn't Spike but that just tells
you how much power you're working with I
mean it's insane it's insane people
should just go Google like scale like
what does X watts do and go through all
the scales from one watt to a kilowatt
to a megawatt and you look at stare at
that and you're how high in the list a
gigawatt is and it's
mind-blowing can you say something about
the cooling so I I know elon's using
liquid cooling I believe in in all cases
uh that's a new thing right most of them
don't use cooling is there something
interesting to say about the cooling
yeah yeah so air cooling has been the
deao standard uh throw a bunch of metal
heat heat pipes Etc and and fans right
and like that's cooled that's been
enough to cool it um people have been
dabbling in water cooling Google's tpus
are water cooled right um so they've
been doing that for a few years uh but
uh with gpus no one's ever done and and
no one's ever done the scale of water
cooling that Elon just did right uh um
now next Generation Nvidia is uh for the
for the like highest end GPU it is
mandat water cooling you have to water
cool it but Elon did it on this current
generation uh and that required a lot of
stuff right if you look at like some of
the satellite photos and stuff of of uh
the Memphis facility there's all these
external water chillers that are sitting
basically it looks like a it looks like
a semi- pod thing what's it called the
container uh but really those are water
chillers and he has like 90 of those
water chillers just sitting outside 90
different containers right with water
you know that chill the water bring it
back to the data center and then you
distribute it to all the chips pull all
the heat out and then send it back right
and this is both a uh way to cool the
chips but also an efficiency thing all
right and going back to that like sort
of three Vector thing right there is um
there is you know memory bandwidth flops
and interconnect the closer the chips
are together the easier it is to do
high-speed interconnects right uh and so
this is this is also like a reason why
you going to go water cooling is because
you can just put the chips right next to
each other and therefore get higher uh
speed
connectivity I got to ask you so in uh
one of your uh recent posts there's a
section called cluster measuring contest
so uh there's another word there but I
won't say it you
know uh what who's who's who's got the
biggest now and who's going to have the
big today individual largest is Elon
right um elon's cluster elon's cluster
in Memphis 200,000 GPS okay right um
meta has like 128,000 opena has 100,000
now now to be clear other compies have
more gpus than Elon they just don't have
them in one place right and for training
you want them tightly connected there's
some techniques that people are
researching and working on that let you
train across multiple regions but for
the most part you want them all in like
one area right so you can connect them
highly with highp speed networking um
and so you know Elon today has 200,000
GP h100s and H 100,000 h100s 100,000
h20s right um meta open AI uh you know
and and and Amazon all have on the scale
of 100,000 a little bit less um but next
this year right this year people are
building much more right anthropic and
Amazon are building a cluster of 400,000
tranium 2 which is Amazon specific chip
uh getting trying to get away from
Nvidia right um you know uh meta and and
and open AI have scales for hundreds of
thousands but by next year you'll have
like 500,000 to 700,000 GPU clusters and
and not those gpus are much higher power
consumption than existing ones right
Hopper 700 Watts Blackwell goes to 12
100 Watts right so so the power per chip
is growing and the number of chips is
growing right NS yeah you think you
think El Elon said he'll get to a
million you think that's actually
feasible um I mean I I I don't doubt
Elon right uh the filings that he has
for like you know the power PL and the
Tesla battery packs it's clear he has
some crazy plans for Memphis um like
permits and stuff there's open record
right um but it's not quite clear that
you know what what and what the time
scales are um I just never
right you know that's he's going to
surprise us so what's the idea with
these clusters if you have a million
gpus what
percentage in uh let's say two three
years is used for uh training and what
percent pre-training and what percent is
used for like for the actual these Mega
clusters make no sense for inference
right uh you could route inference there
and just not train um but most of the
inference capacity is being you know hey
I've got a 30 megawatt data center here
I've got 50 megawatts here I've got 100
here whatever I'll just throw inference
in all of those because the mega
clusters right multi- gigawatt data
centers I want to train there because
that's where all of my gpus are
collocated where I can put them at a
super high networking speed connected
together right because that's what you
need for training now with pre-training
this is the old scale right you could
you would increase parameters you
increase data model gets better uh that
doesn't that doesn't apply anymore
because there's not much more data in
the pre-training side right uh yes
there's video and audio and image that
has not been fully taken advantage of so
there's lot more scaling but a lot of
people like like uh have have transcript
Tak transcripts of YouTube videos and
that gets you a lot of the data doesn't
get you all the learning value out of
the video and image data but you know
there there's there's still scaling to
be done on pre-training uh but this
posttraining world is where all the
flops are going to be spent right the
model's going to play with itself it's
going to self-play it's going to do
verifiable task it's going to do
computer use in sandboxes it might even
do like simulated robotics things right
like all of these things are going to be
environments where compute is spent in
quote unquote post training but I think
I think it's going to be good we're
going to we're going to drop the post
from post training it's going to be
pre-training and it's going to be
training I
thinking at some point um because
because for the like bulk of like the
last few years um pre-training has
dwarfed posttraining but with these
verifiable methods especially ones that
scale really you know potentially
infinitely like computer use and
Robotics not just math and coding right
where you can verify what's happening
those infinitely verifiable tasks it
seems you can spend as much computer as
you want on them especially at the
context length increase cuz the end of
pre-training is when you increase the
context length for these models and
we've talked earlier in the conversation
about how the context length when you
have a long input is much easier to
manage than output and a lot of these
post trainining and reasoning techniques
rely on a ton of sampling and it's
becoming increasingly long context so
it's just like your effectively your
compute efficiency goes down I don't the
flops is the standard for how you
measure it but with RL and you have to
do all these things where you move your
weights around in a different way than
at pre-training and just generation it's
going to become less efficient and flops
is going to be less of a useful term and
then as the infrastructure gets better
it's probably going to go back to flops
so all of the things we've been talking
about is most likely going to be Nvidia
right is there any competitors Google
Google I kind of ignored them 's the
story with what's the story with TPU
like what's the TPU is is awesome right
it's great uh Google is they're a bit
more tepid on building data centers for
some reason they're they're building Big
Data Centers don't get me wrong and they
have they actually have the biggest
cluster let me I I I was talking about
Nvidia clusters they actually have the
biggest cluster period um but the way
they do it is like very interesting
right um they have two sort of like data
center super regions right in that the
data center isn't physically like all of
the gpus aren't physically on one site
but they're like 30 miles from each
other and not gpus TPS right they have
like in in Iowa and Nebraska they have
four data centers that are just like
right next to each other why doesn't
Google Flex its cluster size go to multi
data center training it's a good images
in there so I'll show you what I mean
it's just uh semi analysis multi- dat
Center um so this is like you know so
this is an image of like what a standard
Google data center looks like by the way
their data centers look very different
than anyone else's data centers what are
we looking at here um so these are yeah
so if you if you see this image right in
the center there are these big
rectangular boxes right those are where
the actual chips are kept um and then if
you scroll down a little bit further um
you you can see there's like these water
pipes there's these Chiller cooling
towers in the top and a bunch of like
diesel generators the diesel generators
are backup power the data center itself
is like look physically smaller than the
water chillers right so the chips are
actually easier to like keep together
but then like cooling all the water for
the water cooling is very difficult
right so Google has like a very Advanced
infrastructure that no one else has for
the TPU um and what they do is they've
like stamped these data center they've
stamped a bunch of these data centers
out in a few regions right so if you go
a little bit further um down uh this is
this is a Microsoft this is an Arizona
this is where GPT 5 quote unquote will
be trained um you know uh if it doesn't
exist already yeah if it doesn't exist
already um but each of these data
centers right I've shown a couple images
of them they're like really closely
collocated in the same region right
Nebraska Iowa and then they also have a
similar one in uh Ohio complex right um
and so these data centers are really
close to each other um and what they've
done is they've connected them super
high bandwidth with fiber um and so
these are just a bunch of data centers
and and and the point here is that
Google has a very Advanced
infrastructure um very tightly connected
in a small region so Elon always always
to have the biggest cluster fully
connected right because it's all in one
building yeah right and he's completely
right on that right Google has the
biggest cluster but you have to spread
over three s and by by a significant
margin but you have to go across
multiple sites why doesn't Google
compete with Invidia why don't they sell
tpus I think I think there's a couple
problems with it it's like one TPU has
been a form of allowing search to be
really freaking cheap and build models
for that right um and so like a big
chunk of the search GPU purchases or TPU
purchases or big chunk of Google's
purchases and usage all of it is for
internal workloads right whether it be
search uh now Gemini right uh YouTube um
all these different applications that
they have uh you know ads um these are
where all their tpus are being spent and
that's what they're hyperfocused on
right um and so there's certain like
aspects of the architecture that are
optimized for their use case that are
not optimized elsewhere right one simple
one is like they've open sourced the
Gemma model and and they called it Gemma
7B right uh but then it's actually 8
billion parameters because the
vocabulary is so large because and the
reason they made the vocabulary so large
is because tpus like Matrix multiply
unit is massive because that's what
they've like sort of optimized for and
so they decided oh I'll just make the
vocabulary large too even though it
makes no sense to do so on such a small
model because that fits on their hard
Ware so Gemma doesn't run as efficiently
on a GPU as a llama does right but vice
versa llama doesn't run as efficiently
on a TPU as a Gemma does right and it's
so like there's like certain like
aspects of like Hardware software code
design so all their search models are
their ranking and recommendation models
all these different models that are AI
but not like gen AI right have have been
hyper optimized with G tpus forever the
software stack is super optimized but
all of this software stack has not been
released publicly at all right um very
small portions of it Jackson xlaa have
been but like the experience when you're
inside of Google and you're training on
tpus as a researcher you don't need to
know anything about the hardware in many
cases right like it's like pretty
beautiful but as soon as you step
outside they they all go a lot of them
go back they leave Google and then they
go back yeah yeah they're like they they
leave and they start a company because
they have all these amazing research
ideas and they're like wait
infrastructure is hard software is hard
and this is on gpus or if they try to
use tpus same thing because they don't
have access to all this code and so it's
like how do you convince a company whose
Golden Goose is search where they're
making hundreds of billions of dollars
from to start selling GPU or tpus uh
which they used to only buy a couple
billion of you know I think in 20203
they bought like um like a couple
billion and now they're buying like 10
billion to$ 15 billion worth but how do
you convince them that they could they
should just buy like twice as many and
figure out how to sell them and make $30
billion like who cares about making $30
billion won't that 30 billion exceed
actually the search profit eventually oh
I mean like you're always going to make
more money on Services than than than I
mean like yeah like you like to be clear
like today people are spending a lot
more on Hardware than they are the
services right because you the hardware
front runs the service spend but like
you're investing if if if there's no
revenue for AI stuff or not enough
Revenue then obviously like it's going
to blow up right you know uh people
won't continue to spend on gpus forever
um and Nvidia is trying to move up the
stack with like software that they're
trying to sell and license and stuff
right but Google has never had that like
DNA of like this is a product we should
sell right they don't the Google Cloud
does it is which is a separate
organization from the TPU team which is
a separate organization from the Deep
Mind team which is a separate
organization from the search team right
there's a lot of bureaucracy wait Google
cloud is a separate team than the TPU
team technically TPU in sits under
infrastructure which sits under Google
Cloud but like Google cloud like for
like renting stuff and TPU architecture
are very different goals right in
Hardware um and software like all of
this right like the Jax xla teams do not
serve Google's customers externally
whereas nvidia's various Cuda teams for
like things like nickel serve external
customers right um the internal teams
like Jackson xlaa and stuff they more so
serve Deep Mind in search right and so
their customer is different they're not
building a product for them do do you
understand why AWS keeps winning uh
versus Azure for cloud uh versus Google
CL Google cloud is Tiny isn't it
relative to a Google cloud is third yeah
yeah um Microsoft is the second biggest
but Amazon is the biggest right um and
and Microsoft uh deceptively sort of
includes like Microsoft Office 365 and
things like that like some of these
enterprise-wide licenses so in reality
the gulf is even larger Microsoft is
still second though right um Amazon is
way bigger why because using AWS is
better and easier and in many cases it's
cheaper and it's first it was first yeah
but there's a lot of things that are
first that well it's easier it's harder
to switch than it is to AWS there's big
fees for switching too AWS generates
over 80% of Amazon's profit I think over
90% that's insane the distribution
centers are just like one day we'll
decide to make money from this but they
haven't yet right like they make tiny
little profit from yeah one day Amazon
Prime will triple in price you would
think they would improve AWS uh
interface because it's like horrible
it's like clunky but everybody is I I
yeah you one would think I I think
actually Google's interface is sometimes
nice but it's also like they don't care
about anyone besides their top customers
and like their customer service sucks
and like they have a lot less like I
mean all these companies they op
optimized for the big customers yeah
it's supposed to be for business and
Amazon has always optimized for the
small customer too though right like
obviously they optimize a lot for the
big customer but like when they started
they just would go to like random Bay
Area things and give out credits right
and then they like or just put in your
credit card and use us right like back
in the early days so they've always the
business has grown with them right in
burent so like why does Amazon like why
is snowflake all over Amazon because
snowflake in the beginning when Amazon
didn't care about them was still using
Amazon right and then of course one day
Snowflake and Amazon has a super huge
partnership but like this is the case
like Amazon's user experience and
quality is better also a lot of the
Silicon they've engineered makes them
have a lower cost structure and
traditional cloud storage CPU networking
that kind of stuff uh then um in
databases right like you know I think
like four of Amazon's top five Revenue
products uh margin products sorry like
gross profit products are all database
related products like red shift and like
all these things right like um so so
Amazon has a very like good silicon 2
user experience like entire Pipeline
with AWS I think Google their INF their
silicon teams yeah they have awesome
silicon internally TPU the YouTube chip
um you know some of these other chips
that they've made and the problem is
they're not serving external customers
they're serving internal customers right
it's I mean nvidia's entire culture is
designed from the bottom up to do this
there's this recent book The Nvidia Way
by takim that details this and they're
how they look for future opportunities
and ready their Cuda software libraries
to make it so that new ations of high
performance Computing can very rapidly
be evolved on Cuda and Nvidia chips and
that is entirely different than Google
as a Services business yeah I mean
Nvidia it should be said as a truly
special company like I mean they the
whole the culture of everything they're
really optimized for that kind of thing
speaking of which is there somebody that
can even challenge Nvidia hardware-wise
Intel AMD I I I really don't think so we
went through a like a very long process
of uh working with AMD on training on
their gpus infs and stuff and they're
they're decent their Hardware is better
in many ways than in am nvidias uh the
problem is their software is really bad
and I think they're they're getting
better right they're getting better
faster but they're just the gulf is so
large um and like they don't spend
enough resources on it or haven't
historically right maybe they're
changing their tune now but you know for
for for multiple months we were
submitting the most bugs right like us
semi analysis right like what the
like why are we submitting the most bugs
right cuz they only and they they only
cared about their like biggest customers
and so they'd Shi them a private image
blah blah blah and it's like okay but
like I am just using pie torch and I
want to use the publicly available
libraries and like you don't care about
that right so they're they're getting
better um but like I think AMD is not
possible Intel is obviously in Dire
Straits right now um and needs to be
saved somehow uh very important for
National Security for American you can
you explain the obviously so why why are
they in D Straits going back to earlier
only three
can R&D right Taiwan sinu Samsung uh
pongyang and then Intel Hillsboro
Samsung's doing horribly Intel's doing
horribly we could be in a world where
there's only one company that can do R&D
and that one company already
manufactures most of chips they've been
gaining market share anyways but like
that's that's a critical thing right so
what happens to Taiwan means the rest of
the world's semiconductor industry and
therefore Tech relies on Taiwan right um
and that's obviously precarious um as
far as like Intel they've been slowly
steadily declining they they were on top
of servers and PCs but now Apple's done
the M1 and nvidia's releasing a PC chip
and qualcomm's releasing a PC chip and
in servers hyperscalers are all making
their own Arm based uh server chips and
Intel has no AI silicon uh like winds
right they have very small wins um and
and they never got into Mobile because
they said no to the iPhone and like all
these things have compounded and they've
lost their process technology leadership
right they were ahead for 20 years and
now they're behind by at least couple
years right and they're trying to catch
back up and we'll see if their 18a 14a
strategy works out where they try and
Leap Frog tsmc um but like and Intel is
just like losing tons of money anyways
right and they just fired their CEO even
though the CEO was the only person who
understood the company well right we'll
see he was not the best but he was
pretty good relatively technical guy
where does Intel make most of it money
the CPUs still PCS and data center CPUs
yeah but data center CPUs are all going
cloud and Amazon Microsoft Google are
making AR arm-based CPUs uh and then uh
PC side amd's gained market share
nvidia's launching a chip that's not
going to be success right mediatech
Qualcomm ever launched chips Apple's
doing well right like there they could
get squeezed a little bit in PC although
PC generally I imagine will just stick
Intel mostly for Windows side let's talk
about the broad AI race who do you think
wins who talked about Google the leader
who the default leader has been Google
because of their infrastructure
Advantage well like in the news open AI
is the leader they're the leading and
the the best model they have the best
model that people can use and they're
they have the most AI Revenue yeah open
AI is winning is so who's making money
on AI right now is anyone making money
so accounting profit wise Microsoft is
making money but they're spending a lot
of cap backs right you know and that's
gets depreciated over years uh meta is
making tons of money but with
recommendation systems which is AI but
not with llama right llama's losing
money for sure right um I think
anthropic and open eye are obviously not
making money cuz otherwise they wouldn't
be raising money right they have to
raise money to build more right um
although theoretically they are making
money right like you know you spent few
hundred million do on gp4 and it's doing
billions in Revenue so like obviously
it's like making money although they had
to continue to research to get the
compute efficiency wins right and and
move down the curve uh to like you know
that 12 get that 1200X that has been
achieved for gpt3 you know maybe we're
only at like uh you know a couple
hundred X now but you know with gp4
turbo and 40 and there will be another
one probably cheaper than GP 40 even
that comes out at some point and that
research cost a a lot of money yep
exactly that's the thing that I guess is
not talked about with the cost the that
uh when you're referring to the cost of
the model it's not just the training or
the test runs it's the actual research
the the Manpower that yeah to do things
like reasoning right now that that
exists they're going to scale it they're
going to do a lot of research still I
think I think the you know people focus
on the payback question but it's really
easy to like just be like well like you
know GDP is humans and Industrial
Capital right and if you can make
intelligence cheap then you can grow a
lot right that's the sort of dumb dumb
way to explain it but that's sort of
what basically the investment thesis is
um I think only Nvidia is actually
making tons of money and other Hardware
vendors um the hyperscalers are all on
paper making money uh but in reality
they're like spending a lot more on
purchasing the gpus which you don't know
if they're still going to make this much
money on each GPU in two years right um
You don't know if um you know all of a
sudden open AI goes Kap and now
Microsoft has like hundreds of thousands
of gpus they were renting to open a that
are that they paid for themselves with
their you know investment in them um you
know that that no longer have a customer
right like this is always a possibility
I don't believe that right um I think
you know open ey will keep raising money
I think others will keep raising money
um because the Investments the the
returns from it are going to be
eventually huge once we have AGI so do
you think multiple companies will get
let's I don't think it's win or take all
okay so it's not uh let's not call it
AGI whatever it's like a single day it's
it's a gradual thingi super powerful AI
but it's it's a gradually increasing set
of features that are useful and uh make
rapidly increasing set rapidly
increasing set of features uh so you're
saying a lot of companies will be it
just seems
absurd that all of these companies are
building gigantic data centers there are
companies that will benefit from AI but
not because they train the best model
like meta has so many Avenues to benefit
from Ai and all of their services people
are there people spend time on meta
platforms and it's a way to make more
money per user per hour yeah it seems
like
Google
xxi Tesla important to say and then meta
will benefit not directly from the AI
like the llms but from the
intelligence like the additional boost
of intelligence to the products they
already sell so whether that's the
recommendation system or for Elon who's
been talking about Optimus the robot
potentially the intelligence of the
robot and then you have personalized
robots in the home that kind of thing he
thinks it's a 10 10 plus trillion dollar
business
which at some point maybe I don't not
soon but who knows what robotic Let's do
let's do a tam analysis right 8 billion
humans and let's get 8 billion robots
right and let's let's pay them the
average Sal and yeah there we go 10
trillion more than 10 trillion yeah I
mean you know if if if there's robots
everywhere why does it have to be just
eight eight eight billion robots yeah
yeah of course of course I'm gonna get
I'm gonna have like one robot you're
gonna have like 20 yeah I mean I see
used case for that so yeah so I guess
the benefit would be in the products
sell which is why opening ey is in a
trickier position because they all of
the value of open AI right now as a
brand is in Chachi PT and there is
actually not that for most users there's
not that much of a reason that they need
open AI to be spending billions and
billions of dollars on the next best
model when they could just license llama
five and for be way cheaper so that's
kind of like chat gbt is an extremely
valuable entity to
them but like they could make more money
just off that than the chat application
is clearly like does not have tons of
room to continue right like the standard
chat right where you're just using it
for random questions and stuff right the
cost continues to collapse V3 is the
latest biggest uh but it's going to get
supported by ads right like as you know
llama meta already serves 405b and
probably loses the money but at some
point you know they're going to get uh
the models are going to get so cheap
that they can just serve them for free
with ads supported right and that's what
Google's going to be able to do and
that's obviously they've got a bigger
reach right so chat is not going to be
the only use case it's like these
reasoning code agents computer use all
this stuff is where opena has to
actually go to make money in the future
otherwise they're kaputs but X Google
and meta have these other products so
doesn't isn't it likely that open Ai and
anthropic disappear eventually unless
they're so good at models they are but
it's such a cutting I mean it depends on
where you think AI capabilities are
going you have to keep winning yes you
have to keep winning this as you climb
is even if they capabilities are going
super rapidly awesome into the direction
of AI like there's still a boost for X
in terms of data Google in terms of data
meta in terms of data in terms of other
products and the money and like there's
just huge amount money the whole idea is
human data is kind of tapped out we
don't care we all care about self-play
verifiable
self an R AWS does not make a lot of
money on each individual machine and the
same can be said for the most powerful
AI platform which is even though the
calls to the API are so cheap there's
still a lot of money to be made by
owning that platform and there's a lot
of discussions as it's the next compute
layer you you have to believe that and
and you there's a lot of discussions
that tokens and tokenomics and llm apis
are the next compute layer or or the
next Paradigm for the economy kind of
like energy and oil was but there's also
like you have to sort of believe that
apis and chat are not where AI is stuck
right it is actually just tasks and
agents and Robotics and computer use and
those are the areas where all the value
will be delivered not API not chat
application is it possible you have I
mean it all just becomes a commodity and
you
have uh the the very thin
rapper like
perplexity just joking uh there are a
lot of rappers making a lot of money
yeah so but but do you think it's
possible that people would just even
forget what open Ai and the thropic is
and just because the there'll be
wrappers around the API and it just
dynamically if model progress is not
rapid yeah it's it's becoming a
commodity right deeps V3 shows this but
also the gpt3 3 chart earlier C chart
showed this right llama 3B is 1 1200X
cheaper than gpt3 any gpt3 like anyone
whose business model was gpt3 level
capabilities is dead anyone whose
business models gp4 level capabilities
is dead it is a common saying that the
best businesses being made now are ones
that are predicated on models getting
better right which would be like rappers
thing that is riding the wave of the
models the short term the company that
could make the most money is the one
that figures out what advertising
targeting method works for language
model Generations we have the meta ads
which are hyper targeted in feed not
within specific pieces of content and we
have search ads that are used by Google
and Amazon has been rising a lot on
search but within a piece with within a
return from chat gbt it is not clear how
you get a high quality placed ad within
the output and if you can do that with
model cost coming down you could get
super high Revenue per like that revenue
is totally untapped and it's not clear
technically how it is done yeah that is
I mean the sort of the AdSense
Innovation that Google did the one day
you'll have in GPT output an ad and
that's going to make like billions and
it could be very subtle it could be in
conversation like we have voice mode now
it could be some way of making it so the
voice introduces certain things it's
much harder to measure and it takes
imagination but yeah and it wouldn't be
so shade it wouldn't come off Shady so
you would receive public blowback that
kind of thing so you have to do do it
loud enough to where it's clear it's an
ad and balance all that so that's the
open question they're trying to solve
anthropic and open AI they need to they
might not say they care about that at
all they don't care about it right now I
think it's places like are experimenting
on that more oh interesting yeah for
sure like perplexity Google Meta Care
about this um I think open eye and
anthropic are purely laser focused on
AGI yeah agents and AGI and if I build
AGI I can make tons of money right or I
can spend pay for everything right and
this is this is It's just predicated
like back on the like export control
thing right if you think AGI is 5 10
years away or less right these Labs
think it's two three years away
obviously your your your your actions
are you know if you assume they're
rational actors which they are mostly um
you're what you do in a two-year AGI
versus fiveyear versus 10 years very
very very different right do you think
agents are promising we have to talk
about this this was uh this is like the
excitement of the year that agents are
going to re this is the
generic hype term that a lot of business
folks are using AI agents are going to
revolutionize everything okay so mostly
the the term agent is obviously
overblown we've talked a lot about
reinforcement learning as a way to train
for verifiable outcomes agents should
mean something that is open-ended and is
solving a task independently on its own
and able to adapt to uncertainty there's
a lot of term agent applied to things
like apple intelligence which we still
don't have after the last WWDC which is
orchestrating between apps and that type
of tool use thing is something that
language models can do really well Apple
intelligence I suspect well will come
eventually it's a closed domain it's
your messages app integrating with your
photos with AI in the background that
will work that has been described as an
agent by a lot of software companies to
get into the narrative the question is
what ways can we get language models to
generalize to new domains and solve
their own problems in real time maybe
some tiny amount of training when they
are doing this with fine-tuning
themselves or in context learning which
is the idea of storing information in a
prompt and you can use learning
algorithms to update that and whether or
not you believe that that is going to
actually generalize to things
like me saying book my trip to go to
Austin in two days I have XYZ
constraints and and actually trusting it
I think there's an HCI problem coming
back for
information well what's your what's
what's your prediction there because my
gut says we're very far away from that I
think open eyes uh statement you I don't
know if you've seen the five levels
right where it's chat is level one
reasoning is level two and then agents
is level three and I think there's a
couple more levels but it's important to
note right we were in chat for a couple
years right we just theoretically got to
reasoning will be here for a year or two
right and then agents but at the same
time like people can people can try and
like approximate capabilities of the
next level but the AG agents are doing
things autonomously doing things for
minutes at a time hours at a time Etc
right uh reasoning is doing things for
tens of seconds at a time right and then
coming back with an output that I still
need to verify and use and try check out
right um so so and the biggest problem
is of course like um it's the same thing
with manufacturing right like there's
the whole Six Sigma thing right like you
know how many nines do you get and then
you compound the nines onto each other
and it's like if you multiply you know
by the number of steps that are Six
Sigma you get to uh you know a Yi a
yield or something right so like in
semiconductor manufacturing tens of
thousands of steps
9999999 is not enough right because you
multiply by that by that many times you
actually end up with like 60% yield
right really low yield yeah or zero um
and this is the same thing with agents
right like chaining tasks together each
time llms even the best LMS in
particularly pretty good benchmarks
don't get 100% right they get a little
bit below that because there's a lot of
noise um and so how do you get to enough
nines right this is the same thing with
self-driving we don't we can't have
self-driving because without it being
like super Geo fenced like Google like
Google's right and even then they have a
bunch of tele operators to make sure it
doesn't get stuck right but you can't do
that because it doesn't have enough
nights and self-driving has quite a lot
of structure because roads have rules
it's well defined there's regulation
when you're talking about computer use
for the open web for example or the open
operating system like there's no it's a
mess so like the possibility I'm I'm
always skeptical of any system that is
tasked with
interacting with the human world with
the open Messy thing if we can't get
intelligence that's enough to solve the
human world on its own we can create
infrastructure
like the human operators for weo over
many years that enable certain workflows
there there is a company I don't
remember it but it is but that's
literally their pitches yeah we're just
going to be the human operator when
agents fail and you just call us and we
fix it yeah an API call and it's
hilarious there's going to be
teleoperation markets when we get human
robots which is there's going to be
somebody around the world that's happy
to fix the fact that it can't finish
loading my dishwasher when I'm unhappy
with it but that's just going to be part
of the Tesla Service package I'm I'm
just imagining like AI agent talking to
another AI agent one company has an AI
agent that specializes in helping other
AI agents but if you can make things
that are good at one step you can just
you can stack them together so that's
why I'm like if it takes a long time
we're going to build infrastructure that
enables it you see the operator launch
they have Partnerships with certain
websites with door Dash with open table
with things like this those Partnerships
are going to let them climb really fast
their model's going to get really good
at those things it's going to proof of
concept that might be a network effect
where more companies want to make it
easier for AI some companies will be
like no let's put blockers in place Y
and this is a story of the internet
we've seen we see it now with training
data for language models where companies
are like no you have to pay like
business working it out that said I
think like Airlines have a very and
hotels have high incentive to make their
site work really well and they usually
don't like if you look at how many
clicks it takes to order airplane ticket
it's insane I don't you actually can't
call an American Airlines agent anymore
they they don't have a phone number it's
I mean it's it's it's horrible on many
on the interface front and and all to
imagine that agents will be able to deal
with that website when I as a human
struggle like I have an existential
crisis every time I try to book an
airplane ticket that
I I I don't I think it's going to be
extremely difficult to build an a AI
agent that's robust that but think about
it like United has accept did the
starlink term which is they have to
provide starlink for free and the users
are going to love it what if one Airline
is like we're going to take a year and
we're going to make our website have
white text that works perfectly for the
AIS every time anyone asks about an AI
flight they buy whatever Airline it is
or like they just like here's an API in
it's only exposed to AI agents and if
anyone queres it the price is 10% higher
and and for any flight but we'll let you
see any of our flights and you can just
book any of them here you go agent and
then it's like and I made 10% higher
price awesome and like am I willing to
say that for like hey book me a flight
to CX right and it's like yeah whatever
I think I think you know computers and
real world and the open world are really
really messy um but if you start
defining the problem in Nar in narrow
regions people are going to be able to
create very very productive things um
and and Ratchet down cost massively
right like now crazy things like you
know robotics in the home you know those
are going to be a lot harder to do just
like self-driving right because there's
just a billion different failure modes
right but but like agents that can like
navigate a certain set of websites and
do certain sets of task or like look at
you know look at your you know take a
photo of your grocery uh your fridge and
or like upload your recipes and then
like it figures out what to order from
you know uh Amazon slh Foods food
delivery like that's then that's going
to be like pretty quick and easy to do I
think so it's going to be be a whole
range of like business outcomes and it's
going to be tons of tons of sort of
optimism around people can just figure
out ways to make money to be clear these
sandboxes already exist in research
there are people who have built clones
of all the most popular websites of
Google Amazon blah blah blah to make it
so that there's and I mean open AI
probably has them internally to train
these things it's the same as deep Minds
robotics team for years has had clusters
for robotics where you like you interact
with robots fully remotely they just
have a lab in London and you send tasks
to it it arrang the blocks and you do
this research obviously there's text
there that fix stuff but we've turned
these cranks of automation before you go
from sandbox to progress and then you
add one more domain at a time and
generalize I think in the history of NLP
and language processing instruction
tuning and tasks per language model used
to be like one language model did one
task and then in the instruction tuning
literature there's this point where you
start adding more and more tasks
together where it just starts
generalized to every task and we don't
know where on this curve we are I think
for reasoning with this RL and
verifiable domains we're very early but
we don't know where the point is where
you just start training on enough
domains and poof like more domains just
start working and you've cross the
generalization barrier well what do you
think about the programming
context so software
engineering that you know that's where I
personally and I know a lot of people um
interact with AI the most there's a lot
of fear and angst too from current CS
students but there's also that's where
that is the area where probably the most
AI Revenue productivity gains have come
right um whether it be co-pilots or
cursor or uh what have you right this is
or just standard chat GPT right like a
lot of I don't I know very few
programmers who don't have chat GPT and
actually many of them have the $200 tier
because that's what it's it's so good
for right um I think that in that world
uh we already see it like s bench I if
you've looked at the Benchmark uh made
by some Stanford students I wouldn't say
it's like really hard but I wouldn't say
it's easy either I think like it takes
someone who's been throughout at least
you know a few years of Cs or a couple
years of programming to do sbench well
and the models went from 4% to 60% in
like a year right um and where are they
going to go to next year you know it's
going to be higher probably won't be
100% because again that nines is like
really hard to do uh but we're going to
get to some point where that's and then
we're going to need harder software
engineering benchmarks and so on and so
forth but the the way that like people
think of it now is it's can do code
completion easy it can do some function
generation and have to review it great
but really the the like software
engineering agents I think can be done
faster sooner than any other agent
because it is a verifiable domain um you
can always like unit test or compile um
and and and there's many different
regions of like it can inspect the whole
code base at once which no no engineer
really can only The Architects can
really think about this stuff the really
senior guys and they can Define stuff um
and then the agent can execute on it so
I think I think software engineering
costs are going to plummet like crazy
and and one interesting aspect of that
is when software engineering costs are
really low you get very different
markets right so in the US you have all
these platforms ass companies right
Salesforce and so on and so forth right
in in China no one uses platform SAS
everyone just builds their own stack
because software engineering is much
cheaper in China partially because like
people stem number of stem graduates Etc
uh so stem so it's generally just
cheaper to do um and so at the same time
code for L like code llms have been
adopted much less in China because the
cost of an engineer there is much lower
but like what happens when every company
can just invent their own business logic
like really cheaply and quickly you stop
using platform SAS you start building
custom tailored Solutions you change
them really quickly now all of a sudden
your business is a little bit more
efficient too potentially because you're
not dealing with the hell that is like
some random platform SAS company stuff
not working perfectly and having to
adjust workflows or random business
automation cases that aren't necessarily
AI required it's just logic that needs
to be built that no one has built right
all of these things can go happen fast I
think software and then and then the
other domain is like industrial chemical
mechanical engineers suck at coding
right uh just generally and like their
tools like semiconductor Engineers their
tools are 20 years old all the tools run
on XP including asml lithography tools
run on Windows XP right it's like you
know and and like a lot of the analysis
happens in Excel right like it's just
like guys like you guys can move 20
years forward with all the data you have
and gathered and like do a lot better
it's just you need the engineering
skills for software engineering to be
delivered to the actual domain expert so
I think I think that's the area where
I'm like super duper bullish of of
generally AI creating value the big
picture is that I don't think it's going
to be a cliff it's like we talked I
think the a really good example of how
growth changes is when meta added
stories so Snapchat was on an
exponential they added stories It
flatlined software engineers then up and
to the right AI is going to come in it's
probably just going to be flat it's like
it's not like everyone's going to lose
their job it's hard because the supply
corrects more slowly so the amount of
students is still growing and that'll
correct on a multi-year like a year
delay but the amount of jobs will just
turn and then maybe in 20 40 Years it'll
be well down but in the few years
there'll never going to be the snap
moment where it's like software
Engineers aren't useful I I think also
the nature of what it means to be a
programmer and what kind of jobs
programmers do changes because I think
there needs to be a human in the loop of
everything you've talked about there's a
really important human in that picture
of like correcting the code
like fixing larger than the context
length yeah and debugging also like
debugging by So reading the code
understanding the steering the system
like no no no you missed the point
adding more to the prompt kind of like
yes the adding the human designing the
perfect Google button Google's famous
for having people design buttons that
are so perfect and it's like how like
how is AI going to do that like that
like they could give you all the ideas
perfect fine I mean that's the thing you
can call it taste humans have one thing
humans can do is figure out what other
humans enjoy better than AI systems
that's where the preference you loading
that in but ultimately humans are the
greatest preference generate that's
where the preference comes from and
humans are actually very good at reading
or like judging between two things
versus this is this goes back to the
core of what RL Jeff and preference
tuning is is that it's hard to generate
a good answer for a lot of problems but
it's easy to see which one is better and
that's how we're using a humans for AI
now is judging which one is better and
that's what's off for engineering could
look like is the pr review here's a few
options what are the like here are some
potential pros and cons and they're
going to be judge judges I I think the
thing I would very much recommend is
people start uh programmers start using
Ai and embracing that role of the
supervisor of the AI system and like
partner of the AI system verus is
writing from scratch or not learning
coding at all and just generating stuff
CU I think there actually has to be a
pretty high level of expertise as a
programmer to be able to manage
increasingly intelligent systems I think
it's I think it's that and then becoming
a domain expert in something sure right
because you like seriously if you go
look at Aerospace or semiconductors or
chemical engineering everyone is using
really crappy platforms really old
software like the job of a data science
is as like is like a joke right in many
cases um and cases it's very real but
it's like bring what the Forefront of
human capabilities are to your domain
and like even if the Forefront is like
from the AI your domain you're like at
the Forefront right so it's like it's
like you have to be at the Forefront of
something and then Leverage The the like
Rising tide that is AI for everything
else oh yeah there's so many lwh hanging
fruit everywhere in terms of where
software can like help automate a thing
or digitize the thing in in the legal
system I mean that's why doge is
exciting
you have got to uh hang out with a bunch
of the Doge folks and they I mean
government is like so old school it it
it's like begging for the modernization
of software of organizing the data all
this kind of stuff I mean in that case
it's by Design because bureaucracy
create protects centers of power and so
on but software breaks down those
barriers uh so it hurts those that are
holding on to power but ultimately
benefits Humanity so uh there's a bunch
of domains of that
kind one thing we uh didn't fully finish
talking about is open source so first of
all
congrats you releas a new model yeah
this Tulu I'll explain what a Tulu is a
Tulu is a hybrid camel when you breed a
dromader with a back bakan camel back in
the early days after Chad GPT there was
a big wave of models coming out like
alpaca AA Etc that were all named after
various Mamon species so Tulu is the
brand is multiple years old which comes
from that and we've been playing at the
frontiers of post training with open
source code and this first part of this
release was in the fall where we use we
built on llamas open models open weight
models and then we add in our fully open
code or fully open data there's a
popular Benchmark that is chapot Arena
and that's generally the metric by which
how these chat models are evaluated and
it's humans compare random models from
different organizations and if you
looked at the leaderboard in November or
December among the top 60 models from
tens to 20s of organizations none of
them had open code or data for just post
trining among that even fewer or none
have pre-training data and code
available but it's like posttraining is
much more accessible at this time it's
still pretty cheap and you can do it and
the thing is like how high can we push
this number where people have access to
all the code and data so that's kind of
the motivation of the project we draw in
lessons from llama Nvidia had a neotron
model where the recipe for their post
training was fairly open with some data
and a paper and it's putting all these
together to try to create a recipe that
people can fine-tune models like gp4 to
their domain so to be clear in the case
of Tulu maybe you can talk about Almo
too but in the case of Tulu you're
taking
llama 3
45b Tulu has been a series of recipes
for post training so we've done multiple
models over years yeah and so you're
open sourcing everything yeah if you
start with an open weight based model
the like whole model technically is an
open source because you don't know what
llama put into it which is why we have
the separate thing that we'll get to but
it's just getting parts of the pipeline
where people can zoom in and customize I
know I hear from startups and businesses
that're like okay like I can take this
post training and try to apply it to my
domain we talk about verifiers a lot we
use this idea which is reinforcement
learning with verifiable domain reward s
RL VR kind of similar to R
lhf and we applied it to math and the
model today which is like we applied it
to the Llama 405b base model from last
year and we have our other stuff we have
our instruction tuning and preference
tuning but the math thing is interesting
which is like it's easier to improve
this math benchmark there's a benchmark
mat math all capitals tough name on The
Benchmark is name is the area that
you're evaluating we're researchers
we're not we're not Brands brand
strategists and this is something that
the deeps paper talked about as well is
like at this bigger model it's easier to
elicit powerful capabilities with this
RL training and then they distill it
down from that big model to the small
model and this model we released today
we saw the same thing is we're at ai2 we
don't have a ton of compute we can't
train 405b models all the time so we
just did a few runs and they tend to
work and it's like it just shows that
there's a lot of room for people to play
in these things and and they crushed
llama's actual release right like the
they're way better than it yeah so our
Val numbers I mean we have extra months
in this but our Val numbers are like
much better than the Llama instruct
model that they released and you also
said better than deep seek V3 yeah on
our Val Benchmark the most deep seek V3
is really similar we have a safety
Benchmark to understand if it will say
harmful things and things like that and
that's what draws down most of the way
it's still it's like an amalgamation of
multiple benchmarks or what do you mean
yeah so we have a 10 this is like this
is standard practice in post training is
you choose your evaluations you care
about in academics and smaller Labs
you'll have fewer evaluations in
companies you'll have a really one
domain that you really care about in
Frontier Labs you'll have tens to 20s to
maybe even like a 100 valuations of
specific things so we choose a
representative Suite of things that look
like chat precise instruction following
which is like respond only in emojis
like does the model follow weird things
like that yeah math code and you create
a suite like this so safety would be one
of 10 and that type of site where you
have like what ises the broader
community of AI care about and for
example in comparison to deep seek it
would be something like our average of
Val for our model would be um 80
including safety and similar without and
deep seek would be like
79 um% average
score without safety and their safety
score would bring it down to like you
beat them even ignoring safety yeah so
this is something that internally it's
like I don't want to win only by like
how you shape the valve Benchmark so if
there's something that's like people may
may not care about safety in their model
safety can come Downstream safety can be
when you host the model for an API like
safety is addressed in a spectrum of
locations in a application so it's like
if you want to say that you have the
best recipe you can't just gate it on
these things that some people might not
want and and this is just it's like the
time of progress we benefit if we can
release a model later we have more time
to learn new techniques like this RL
Technique we had started this in the
fall it's now really popular reasoning
models the next thing to do for open
open source post trining is to scale up
verifiers to scale up data to replicate
some of deep seeks results and it's
awesome that we have a paper to draw on
and it makes it a lot easier and that's
the type of things that is going on
among academic and closed Frontier
research in AI since you're pushing open
source what do you think is the future
of it you think deep seek actually
changes things since it's open source or
open weight or is pushing the open
source movement into the open Direction
This goes very back to the license
discussions so deep seek R1 with a
friendly license is a major reset so
it's like the first time that we've had
a really clear Frontier Model that is
open weights and with a commercially
friendly license with no restrictions on
Downstream use cases synthetic data
distillation whatever this has never
been the case at all in the history of
AI in the last few years since cat gbt
there have been models that are off the
frontier or models with weird licenses
that you can't really use them so is is
isn't meta's license like pretty much
permissible except for five companies um
and there's also so this goes to what
open source AI is which is there's also
use case restrictions in the Llama
license which says you can't use it for
specific things so if you come from an
open source software background you
would say that that is not an open-
Source license what what kind of things
are those though like are they like it's
I at this point I can't pull them
off competitor it used to be military
use was one and they removed that for
scale it'll be like like cam like child
abuse material like that's the type of
thing that is forbidden there but that's
enough from an open source background to
say it's not open source license and
also the Llama license has this horrible
thing where you have to name your model
llama if you touch it to the Llama model
so it's like the branding thing so if a
company uses llama technically the
license says that they should say built
with llama at the bottom of their
application and from like a marketing
perspective that just that just hurts
like I I could suck it up as a
researcher I'm like oh it's fine like it
says llama Dash on all of our on all of
our materials for this release but this
is why we need truly open models which
is uh we don't know deep r1's data wait
so you're saying I can't make a you know
cheap copy of llama and pretend it's
mine but I can do this with the Chinese
model yeah hell
yeah that's that's what I'm saying and
and that's why it's like we want this
whole open language models thing Theo
thing is to try to keep the model where
everything is open with the data as
close to the frontier as possible so
we're compute constrained we're
Personnel constrained we're we we rely
on getting insights from people like
John Schulman tells us to do RL on
outputs like we can make these big jumps
but it just takes a long time to push
the frontier of Open Source and
fundamentally I would say that that's
because open source AI does not have the
same feedback loops as open source
software we talked about open source
software for security also is just
because you build something once and you
can reuse it if you go into a new
company there's so many benefits but if
you open source a language model you
have you have this data sitting around
you have this training code it's not
like that easy for someone to come and
build on and improve because you need to
spend a lot on compute you need to have
expertise so until there are feedback
loops of Open Source AI it seems like
mostly an IDE ideological Mission like
people like Mark Zuckerberg which is
like America needs this and I agree with
him but in the time where the motivation
ideologically is high we need to
capitalize and build this ecosystem
around what benefits do you get from
seeing the language model data and
there's not a lot about there we're
going to try to launch a demo soon where
you can look at AO model and a query and
see what pre-training data similar to it
which was like legally risky and
complicated but it's like what does it
mean to see the data that the AI was
trained on it's hard to parse it's
terabytes of files it's like I I I don't
know what I'm going to find in there but
that's what that's what we need to do as
an ecosystem if people want open source
AI to be financially useful we didn't
really talk about Stargate I would love
to get your opinion on like what the new
Administration the Trump Administration
everything that's doing that's being
done in from the America side in
supporting AI infrastructure and the
efforts of the different AI companies
what do you think about Stargate what
are we supposed to think about Stargate
and uh does Sam have the
money yeah so I think uh Stargate is a
opaque thing it definitely doesn't have
$500 billion doesn't even have hundred
billion do right so what they announced
is this $500 billion number Larry
Ellison Sam Alman and and Trump said it
um they thanked Trump and it's uh and
it's used the the Trump did do some
executive actions that like do
significantly improve the ability for
this to be built faster um you know one
of the executive actions he did is on
Federal Land you can just basically
build data centers in power you know
like pretty much like that uh and then
the permitting process is basically gone
or you file after the fact so like one
of the again like I had a skitso take
earlier another skitso take if you've
ever been to the precidio in San
Francisco beautiful area you could build
a power plant in a data center there if
you wanted to because it is federal land
it used to be a military base
but you know obviously this would like
piss people off you know it's a good bit
anyways Trump trump has made it much
easier to do this right generally Texas
has the only unregulated Grid in the in
the nation as well let's go Texas um and
so you know therefore like OT enables
people to build faster as well in
addition the Federal Regulations are
coming down um and so Stargate is
predicated and this is why that whole
show happened now how they came up with
a $500 billion number is beyond me how
they came up with a hundred billion
dollar number makes sense to some extent
right and um there's actually a good
table in here that I would like to show
um in the in that uh Stargate piece that
I had
um it's it's the it's the most recent
one yeah so so anyways Stargate um you
know it's it's basically right like
there is uh it's it's a table about cost
um there you passed it already it's that
one so this table is kind of explaining
what happens right so Stargate is in
abalene Texas the first hundred billion
dollar of it uh that site is 2.2 GW of
power in about 1.8 gwatt of power uh
consumed right um per GPU they they they
have like roughly uh Oracle is already
building the first part of of this
before Stargate came about to be clear
they've been building it for a year they
tried to rent it to Elon in fact right
um but Elon was like it's too slow I
need it faster so then he went and did
his Memphis thing um and so opening was
able to get it uh with this like weird
joint venture called Stargate uh they
initially signed a deal with just Oracle
for the first section of this cluster
right this first section of this cluster
right is roughly um5 billion to $6
billion of server spend right and then
there's another billion or so of data
center spend but the and then and then
likewise like if you fill out that
entire 1.8 gws with the next two
generations of Nvidia chips gb200 gb300
vr200 um and you fill it out completely
that ends up being roughly $50 billion
of server cost right plus there's data
center Cost Plus maintenance cost plus
operation Cost Plus um all these things
and that's where openai gets to their
hundred billion doll announcement that
they had right because they talked about
a100 billion doar is phase one that's
this abalene Texas data center right
h100 billion do of total cost of
ownership quote unquote right uh so it's
not capex it's not investment it's $100
billion of total cost of ownership and
then and then there will be future
phases they're looking at other sites
that are even bigger than this 2.2 gaw
by the way uh in Texas and elsewhere um
and so they're they're not you know
completely ignoring that but there is
there is the number of hundred billion
dollar that they say is for phase one uh
which I do think will happen they don't
even have the money for that um
furthermore it's not $100 billion it's
$50 billion of spend right and then like
$50 billion of operational cost power
Etc um rental pricing Etc um because
they're renting it from opening eyes is
renting the gpus from the Stargate joint
vure right what money do they actually
have right soft Bank Soft bank is going
to invest Oracle is going to invest open
is going to invest open is on the line
for $19 billion everyone knows that
they've only got six billion in their
last round and four billion in debt so
but there is there's like news of like
SoftBank maybe investing 25 billion into
open AI right so that's that's that's
part of it right so 19 billion can come
from there so open a does not have the
money at all right to be clear um Inc is
not dried on anything open has Z doar
for this 50 billion right and which
they're legally obligated to put 19
billion of capex or into the joint
venture and then the rest they're going
to pay via renting the gpus from the
joint venture and then there's um then
there's Oracle Oracle has a lot of money
they're building the first section
completely they were spending for it
themselves right this $6 billion of
capex $1 billion at TCO um but they and
they were going to do that first section
they're paying for that right um as far
as the rest of the section I don't know
how much Larry wants to spend right at
any point he can pull out right like
this is again it's like completely
voluntary so any point there's no signed
on this right but he potentially could
contribute tens of billions of dollars
right to be clear he's got the money
Oracle got the money um and then there's
like mgx which is the sou the UAE fund
which technically has $1.5 trillion do
for investing in AI but again like I
don't know how real that money is and
like whereas there is no ink signed for
this SoftBank does not have $2 billion
of cash they have to sell down their
stake in arm uh which is you know the
leader in CPUs and they they ipoed it
this is obviously what they've always
wanted to do they just didn't know where
redeploy the capital selling down the
stake and arm makes a ton of sense so
they can sell that down and invest in in
this if they want to and invest in open
AA if they want to um as far as like
money secured the first 100,000 gb200
cluster is like can fund be funded
everything else after that up in the air
is up in the air money's coming I
believe the money will come I personally
do just it's a belief okay it's a belief
that they are going to release better
models and be able to raise more M right
but like the actual reality is is that
elon's right there is the money does not
exist right what does the US government
have to do with anything what does Trump
have to do with everything he's just a
hype man Trump is he's reducing the
regulation so they can build it faster
right um and he's allowing them to do it
right you know because like any
investment of this side is going to
involve like antitrust stuff right like
so obviously he's gonna he's going to
allow them to do it he's going to enable
the regulations to actually allow to be
built uh I don't believe there's any US
Government dollars being spent on this
though yeah so I think he's also just
creating a general vibe that this is
regulation will go down and this is the
era of building so if you're a builder
you want to create stuff you want to
launch stuff this is the time to do it
and so like we've had this 1.8 gwatt
data center in our data for over a year
now and we've been like sort of sending
it to all of our clients including many
of these companies that are building the
multi- gigawatts but that is like at a
level that's not quite maybe Executives
like seeing $500 billion $100 billion
and then everyone's asking them like so
it could spur like another like an even
faster arms race right CU there's
already at arms race but like this this
like 100 billion 500 billion doll number
Trump talking about it on TV like it
could spur the arm race to be even
faster um and more investors to flood in
and etc etc so I think I think you're
right is that uh in that uh sense that
open AI uh or sort of trump is sort of
like championing people are going to
build more and his actions are going to
let people build more what are you uh
what are you excited
about about these uh several years that
are upcoming in terms of cluster build
outs in terms of uh breakthroughs in AI
like the best possible future you can
imagine in the next couple years 2 3 4
years what does that look like just it
could be very specific technical things
like breakthroughs on post post
training or it could be just size big
yeah I mean it's impressive clusters I
really I really enjoy tracking supply
chain and like who's involved in what I
really do it's really fun to see like
the numbers the cost who's building what
capacity helping them figure out how
much capacity they should build winning
deals strategic stuff that's really cool
I think technologically uh there's a lot
around the networking side that really
excites me uh with Optics and elect
Electronics right like kind of getting
closer and closer whether it be co-
package Optics or some sort of like
forms of new forms of switching this is
internal to a cluster cluster yeah um
also multi-data center training right
like there's uh people are putting so
much fiber between these data centers
and lighting it up with so many
different you know with so much
bandwidth that there's a lot of
interesting stuff happening on that end
right Telecom has been really boring
since 5G and now it's like really
exciting again um on side can you
educate me a little bit about the speed
of things so the speed of memory versus
the speed of interconnect versus the
speed of fiber between data centers are
is are these like orders of magnitude
different is can we at some point
converge towards a place where it all
just feels like one computer uh no I
don't think that's possible um it's
going to it's only going to get harder
to program not easier um it's only going
to get more difficult and complicated in
more layers right uh the the general
image that people like to have is like
this hierarchy of memory so on chip is
really close localized within the chip
right you know there you have registers
right and those are shared between some
compute elements and then you'll have
caches which are shared between more
compute elements then you have like
memory right like hbm or Dam like DDR
memory or whatever it is and that's
shared between the whole chip um and
then you can have you know pools of
memory that are shared between many
chips right um and then storage and it
keep you keep zoning out right the
access latency across data centers
across within the data center within a
chip is diff so like you're obviously
always you're always going to have
different um programming paradigms for
this it's not going to be easy
programming this stuff is going to be
hard maybe I can help right um you know
with programming this but the the the
way to think about it is that like there
is there there's sort of like the more
elements you add to a task you you don't
gain you don't get strong skills right
if I double the number of chips I don't
get 2x the performance right this is
just like a reality of computing uh
because there's inefficiencies um and
there's a lot of interesting work being
done to make it not you know uh to make
it more linear whether it's making the
chips more networked together more
tightly or uh you know cool programming
models or cool algorithmic things that
you can do on the model side right deep
seek did some of these really cool
Innovations because they were limited on
interconnect but they still needed a
parallel eyes right like all sorts you
know all everyone's always doing stuff
Google's got a bunch of work and
everyone's got a bunch of work about
this
that stuff is super exciting on the
model and workload and Innovation side
right Hardware solid state Transformers
are interesting right for the power side
there's all sorts of stuff on batteries
and there's all sorts of stuff on you
know I think I think when you look at if
you look at every layer of the compute
stack right whether it goes from
lithography and ET all the way to like
fabrication to like Optics to networking
to power to Transformers to cooling to
you know a networking and you just go on
up and up and up and up the stack you
know even air conditioners for data
centers are like innovating right like
like it's like there's like copper
cables are innovating right like you
wouldn't think it but copper cables like
are there's some Innovations happening
there with like the density of how you
can pack them and like it's like all of
these layers of the stack all the way up
to the models human progress is at a
pace that's never been seen before I'm
just imagining you sitting back in a lay
somewhere with screens everywhere just
monitoring the supply chain where all
these clusters like all the information
information you're Gathering I mean you
do a big team there's a big
team I mean you're you you do quite
incredible work uh with semi analysis I
mean
just uh keeping your finger on the pulse
of human civilization in the digital
world it's pretty cool like just to
watch feel that yeah thank you I guess
feel feel all of us like doing epic
feel the AI feel the I mean from
meme to like reality um what Nathan is
there like breakthroughs that you're
like looking forward to potentially I
had a while to think about this while
listening to D's beautiful he didn't
listen to me I knew no I knew this was
coming and it's like realistically
training models is very fun because
there's so much lwh hanging fruit and
the thing that makes my job entertaining
I train models I write analysis about
what's happening with models and it's
fun because there is obviously so much
more progress to be had and the real
motivation why I do this somewhere where
I can share things is that there's just
I don't trust people that are like trust
me bro we're going to make AI good
that's like we're the ones that it's
like we're going to do it and you can
trust us and we're just going to have
all the a Ai and it's just like I would
like a future where more people have a
say and what AI is and can understand it
and that's it's it's a little bit less
fun that it's not a like positive thing
of like this is just all really fun like
training models is fun and bring people
in is fun but it's really like AI if it
is going to be the most powerful
technology of my lifetime it's like we
need to have a lot of people involved in
making that and making it making it open
helps with that as accessible as
possible as open as possible yeah in the
my read of the last few years is that
more openness would help the AI
ecosystem in terms of having more people
understand what's going on rather that's
researchers from non- AI fields to
governments to everything it doesn't
mean that openness will always be the
answer I think then it will reassess of
like what is the biggest problem facing
Ai and Tack on a different angle to the
wild ride that we're on and uh for me
just from even the user experience
anytime you have the like aathi said the
the AHA moments like the magic like
seeing the reasoning The Chain of
Thought it's like there's something
really just fundamentally beautiful
about that it's uh putting a mirror to
ourselves and seeing like oh it is
solving intelligence as the cliche like
goal of these companies is and you get
to understand to why we humans are
special the intelligence within us is
special and for now also why we're
special in terms of we seem to be
conscious and the AI systems for now uh
aren't and we get to Sol we get to
explore that mystery so that's it's just
really cool to get to explore these
questions that I don't
think I would have never imagined uh
would be even possible uh back when uh s
just watching with excitement deep blue
be
Kasparov like I wouldn't have ever
thought this kind of AI would be
possible in my lifetime this like this
is really feels like AI it's incredible
I started with AI of learning to fly a
silia quad RoR it's like Learn to Fly
and it just like it learned to fly up it
would hit the ceiling and stop and catch
it it's like okay that is like really
stupid compared to what's going on now
and now you could probably with natural
language tell it to learn to fly and
it's going to generate the control
algorithm required to do that probably
there's lowlevel blockers like we had to
do some weird stuff for that but you can
you you definitely our robotics
conversation yeah when you have to
interact in an actual physical world
it's hard what gives you hope about the
future of human
civilization looking into the next 10
years 100 years thousand years how long
you think we make it you think we got a
thousand years humans will definitely be
around in a thousand years I think
there's there's ways that very bad
things could happen there'll be way
fewer humans but humans are very good at
surviving there's been a lot of things
that that is true I don't think they're
necessarily we're good at long-term
credit assignment of risk but when the
risk becomes immediate we tend to figure
things out and oh yeah for that reason
I'm like there's physical constraints to
things like AGI hyper like recursive
Improvement to kill us all type stuff
I'm for the physical reasons and for how
humans have figured things out before
I'm not too worried about an AI takeover
there are other International things
that are worrying but there's just fun
fundamental human goodness and trying to
amplify that I like we're on a tenuous
time and I mean if you look at Humanity
as as a whole there's been times where
things go backwards there's times when
things don't happen at all and we're on
a what should be very positive
trajectory right now yeah there seems to
be progress but just like with with with
power uh there's like spikes of human
suffering and we want to try to minimize
the amount of spikes generally human is
going to suffer a lot less right I'm
very optimistic about that um I do worry
of like techn fascism type stuff arising
as uh AI becomes more and more prevalent
and powerful and those who control it
can do more and more uh maybe it doesn't
kill us all uh but at some point every
very powerful human is going to want a
brain computer interface so that they
can interact with the a AGI and all of
its advantages in many more way and
merge its mind with you know sort of
like and its capabilities or that
person's capabilities uh can leverage
those much better than anyone else and
therefore be you know it won't be one
person rule them all but it will be uh
you know the thing I worry about is
it'll be like few people you know you
know hundreds thousands tens of
thousands maybe millions of people rule
whoever's left right um and the economy
around it right and I think it'll that's
like the the thing that's probably more
worrisome is like human machine
amalgamations this enables an individual
human to have more impact on the world
and that impact can be both positive and
negative right
uh generally humans have positive
impacts on the world at least Society uh
but it's possible for individual humans
to have such negative impacts and AGI at
least as I think the labs Define it
which is not a runaway sentient thing
but rather just something that can do a
lot of tasks really efficiently um
amplifies the capabilities of someone
causing extreme damage uh but but for
the most part I think it'll be used for
you know profit-seeking motives which
will then reduce which will increase the
abundance and supply of things and
therefore reduce suffering
right yeah that's the goal scrolling on
a a timeline just stasis it's holding
scrolling holds the status quo of the
world that is a positive outcome right
like it's like if I have food tubes and
like scrolling and I'm happy that's a
positive
outcome while expanding out into the
cosmos uh well this is a fun time to be
alive and thank you for pushing the
Forefront of what is possible in human
and thank you for talking today this is
fun thanks for having us thanks for
having us thanks for listening to this
conversation with Dylan Patel and Nathan
Lambert to support this podcast please
check out our sponsors in the
description and now let me leave you
some words from Richard Fineman for a
successful technology reality must take
precedence over public relations for
nature cannot be
fooled thank you for listening and hope
to see you next time e