DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters | Lex Fridman Podcast #459
_1f-o0nqpEI • 2025-02-03
Transcript preview
Open
Kind: captions
Language: en
the following is a conversation with
Dylan Patel and Nathan Lambert Dylan
runs semi analysis A well respected
research and Analysis company that
specializes in semiconductors gpus CPUs
and AI Hardware in general Nathan is a
research scientist at the Allen
Institute for AI and is the author of
the amazing blog on AI called
interconnects they are both highly
respected red and and listened to by the
experts researchers and engineers in the
field of AI and personally I'm just a
fan of the two of them so I used the
Deep seek moment that shook the AI World
a bit as an opportunity to sit down with
them and lay it all out from Deep seek
open AI Google xai meta anthropic to
Nvidia and tsmc and to us China Taiwan
relations and everything else that is
happening at the cutting Ed of AI this
conversation is a deep dive into many
critical aspects of the AI industry
while it does get super technical we try
to make sure that it's still accessible
to folks outside of the AI field by
defining terms stating important
Concepts explicitly spelling out
acronyms and in general always moving
across the several layers of abstraction
and levels of detail there is a lot of
hype in the media about what AI is and
isn't the purpose of this podcast in
part is to cut through the hype through
the and the low resolution
analysis and to discuss in detail how
stuff works and what the implications
are let me also if I may comment on the
new open AI 03 mini reasoning model the
release of which we were anticipating
during the conversation and it did
indeed come out right after its
capabilities and costs are on par with
our expectations as we
stated open AI 03 mini is indeed a great
model but it should be stated that uh
deep SEC car 1 has similar performance
on benchmarks is still cheaper and it
reveals its Chain of Thought reasoning
which O3 mini does not it only shows a
summary of the reasoning plus R1 is open
weight and uh 03 mini is
not by the way I got a chance to play
with uh O3 mini and uh anecdotal Vibe
checkwise I felt that O3 mini
specifically O3 mini high is uh better
than R1 still for me personally I find
that Claude Sona 35 is the best model
for programming except for tricky cases
where I will use 01 Pro to
brainstorm either way many more better
AI models will come including reasoning
models both from American and Chinese
companies
they will continue to shift the cost
curve but the quote deep seek moment is
indeed real I think it will still be
remembered 5 years from now as a pivotal
event in Tech History due in part to the
geopolitical implications but for other
reasons too as we discuss in detail from
many perspectives in this
conversation this is leex Freedman
podcast to support it please check out
our sponsors in the description and now
dear friends here's Dyan Patel and
Nathan
Lambert a lot of people are curious to
understand China's deep seki models so
let's lay it out Nathan can you describe
what deep seek V3 and deep seek R1 are
how they work how they're trained Let's
uh look at the big picture and then
we'll zoom in on the details yeah so
deep seek V3 is a new mixture of experts
Transformer language model from Deep
seek who is based in China they have
some new specifics in the model that
we'll get into largely this is a open
weight model and it's a instruction
model like what you would use in chat
GPT um they also release what is called
the base model which is before these
techniques of posttraining most people
use instruction models today and those
are what's served in all sorts of
applications this was released on I
believe December 26th or that week and
then weeks later on January 20th deep
seek released deep seek R1 which is a
reasoning model which really accelerated
a lot of this discussion this resenting
model has a lot of overlapping training
steps to deep seek V3 and it's confusing
that you have a base model called V3
that you do some too to get a chat model
and then you do some different things to
get a reasoning model I think a lot of
the AI industry is going through this
challenge of communications right now
where open AI makes fun of their own
naming scheme they have gbt 4 they have
open
ai1 and there's a lot of types of models
so we're going to break down what each
of them are there's a lot of technical
specifics on training and go from high
level to specific and kind of go through
each of them there's so many places we
can go here but maybe let's go to open
weights first what does it mean for
model to be open weights and what are
the different flavors of Open Source in
general yeah so this discussion has been
going on for a long time in AI it became
more important since chat gbt or more
focal since trat BT at the end of 2022
open weights is the accepted term for um
when model weights of a language model
are available on the internet for people
to download those weights can have
different licenses which is the
effectively the terms by which you can
use the model there are licenses that
come from history and open source
software there are licenses that are
designed by companies specifically um
all of llama deep seek quen mistol these
popular names in open weight models have
some of their own licenses it's
complicated because not all the same
models have the same
terms the big debate is on what makes a
model open weight it's like why are we
saying this term it's kind of a mouthful
it sounds close to open source but it's
not the same there's still a lot of
debate on the definition and soul of
open- source AI open source software has
a rich history on freedom to modify
freedom to take on your own freedom for
many restrictions on how you would use
the software and what that means for AI
is still being defined
so uh for what I do I work at the Allen
Institute for AI we're a nonprofit We
want to make AI open for everybody and
we try to lead on what we think is truly
open source there's not full agreement
in the community but for us that means
releasing the training data releasing
the training code and then also having
open weights like this and we'll get
into the details of the models and again
and again as we try to get deep into how
the models will train were trained we
will say things like the data processing
data filtering data quality is the
number one determinant of the model
quality and then a lot of the training
code is the determinant on how long it
takes to train and how faster
experimentation is so without fully
open- Source models where you have
access to this data it is hard to know
or it's harder to replicate so we'll get
into cost numbers for deeps B3 on mostly
GPU hours and how much you could pay to
rent those yourselves but without the
data the replication cost is going to be
far far higher and same goes for the
code we should also say that this is
probably one of the more open models out
of the frontier models so like in this
full spectrum where probably the fullest
open source like you said open code open
data open weights this is not open code
this is probably not open data
and this is open weights and the
licensing is uh MIT license or it's uh I
mean there's some nuance and the
different models but it's towards the
free in terms of the open source
movement these are the kind of the good
guys yeah deep seek is doing fantastic
work for disseminating understanding of
AI their papers are extremely detailed
in what they do and for other teams
around the world they're very actionable
in terms of improving your own training
techniques
uh and we'll talk about licenses more
the Deep seek R1 model has a very
permissive license it's called the M
license that effectively means there's
no Downstream restrictions on commercial
use there's no use case restrictions you
can use the outputs from the models to
create synthetic data and this is all
fantastic I think the closest pier is
something like llama where you have the
weights and you have a technical report
and the technical report is very good
for llama one of the most red p PDFs of
the year last year is the Llama 3 paper
but in some ways it's slightly less
actionable it has less details on the
training specifics like less plots um
and so on and the Llama 3 license is
more restrictive than MIT and then
between the deep sea custom license and
the Llama license we could get into this
whole Rabbit Hole I think we we we'll
make sure we want to go down the license
rabbit hole before we do specifics yeah
and I mean so it should be stated that
one of the implications of deep secret
puts pressure on llama and everybody
else on open AI to push towards uh open
source and that's the other side of Open
Source that uh you mentioned is how much
is published in detail about it so how
open are you with the sort of the
insights behind the code so like how
good is the technical reports are they
hand wavy or is there actual uh details
in there and that's one of the things
that deep seek did well is they publish
a lot of the details yeah especially in
the deeps V3 which is their pre-training
paper they were very clear that they are
doing inter itions on the technical
stack that go at many different levels
for example on their to get highly
efficient training they're making
modifications at or below the Cuda layer
for NVIDIA
chips I have never worked there myself
and there are a few people in the world
that do that very well and some of them
are at Deep seek and these types of
people are at Deep seek and leading
American frontier Labs but there are not
many places to help people understand
the other implication of open weights
just you know there's uh a topic we'll
return to often here
so there's a uh fear that China the
nation might have interest in um
stealing American data violating privacy
of American citizens what can we say
about open weights to help us understand
what what the weights are able to do
yeah in terms of stealing people's data
yeah so these weights that you can
download from hugging face or other
platforms are very big matrices of
numbers you can download them to a
computer in your own house that has no
internet and you can run this model and
you're totally control of your data that
is something that is different than how
a lot of language model usage is
actually done today which is mostly
through apis where you send your prompt
to gpus run by certain companies and
these companies will have different
distributions and policies on how your
data is stored if it is used to train
future models where it is stored if it
is encrypted and so on so the open
weights you have your fate of data in
your own hands and that is something
that is deeply connected to the soul of
Open Source so it's not the model that
steals your data it's clovers hosting
the model which could be China if you're
using the Deep seek app or it could be
perplexity uh you know you're trusting
them with your data or open AI you're
trusting them with your data and some of
these are American companies some of
these are Chinese companies but the
model itself is not doing the stealing
it's the host all right
so uh back to the basics what's the
difference between deep seek V3 and deep
seek R1 can we try to like lay out the
confusion potential yes so for one I
have very understanding of many people
being confused by these two model names
so I would say the best way to think
about this is that when training a
language model you have what is called
pre-training which is when you're
predicting the large amounts of mostly
internet text you're trying to predict
the next token and what to know about
these new deep seek models is that they
do this internet large scale
pre-training once to get what is called
Deep seek V3 base this is a base model
it's just going to finish your sentences
for you it's going to be harder to work
with than chat GPT and then what deep
seek did is they've done two different
posttraining regimes to make the models
have specific desirable behaviors so
what is the more normal model in terms
of the last few years of AI an instruct
model a chat model a quote unquote
aligned model a helpful model there are
many ways to describe this is more
standard post training so this is things
like instruction tuning reinforce
learning from Human feedback we'll get
into some of these words and this is
what they did to create the deeps V3
model this was the first Model to be
released and it is very high performant
it's competitive with gp4 llama 405b so
on and then when this
release was happening we don't know
their exact timeline or soon after they
were finishing the training of a
different training process from the same
next token prediction base model that I
talked about which is when this new
reasoning training that people have
heard about comes in in order to create
the model that is called Deep seek R1
the r through this conversation is good
for grounding for reasoning and the name
is also similar to open AI 01 which is
other reasoning model that people have
heard about and we have to break down
the training for R1 in more detail
because for one we have a paper
detailing it but also it is a far newer
set of techniques for the AI community
so is a much more rapidly evolving area
of research maybe we should also say the
big two categories of training of
pre-training and posttraining these
umbrella terms that people use so what
is pre-training and what is posttraining
and what are the different flavors of
things underneath posttraining umbrella
yeah so pre-training I'm using some of
the same words to really get the message
across is you're doing what is called
autor regressive prediction to predict
the next token in a series of documents
this is done over standard practice is
trillions of tokens so this is a ton of
data that is mostly scraped from the web
in some of deep se's earlier papers they
talk about their training data being
distilled for Math and I shouldn't use
this word yet but taken from common
crawl and that's a public access that
anyone listening to this could go
download data from the common crawl
website this is a crawler that is
maintained publicly yes other tech
companies eventually shift to their own
crawler and deepy likely has done this
as well as most Frontier Labs do but
this sort of data is something that
people can get started with and you're
just predicting text in a series of
documents this
is can be scaled to be very efficient
and there's a lot of numbers that are
thrown around in AI training like how
many floating Point operations or flops
are used and then you can also look at
how many hours of these gpus that are
used and it's largely one loss function
taken to a
very large amount of of compute usage
you just you set up really efficient
systems and then at the end of that you
have this space model and pre-training
is where there is a lot more of
complexity in terms of how the process
is emerging or evolving and the
different types of training losses will
use I think this is a lot of techniques
grounded in the natural language
processing literature the oldest
technique which is still used today is
something called instruction tuning or
also known as supervised fine tuning
these acronyms will be if or sft it's
that people really go back and forth
throughout them and I will probably do
the same which is where you add this
formatting to the model where it knows
to take a question that is like explain
the history of the Roman Empire iror to
me and or something you a sort of
question you'll see on Reddit or stack
Overflow and then the model will respond
in a information dense but presentable
manner the core of that formatting is in
this instruction tuning phase and then
there's two other categories of loss
functions that are being used today one
I will classify as preference fine
tuning preference fine tuning is a
generalized term for what came out of
reinforcement learning from Human
feedback which is rhf this reinforce
learning from Human feedback is credited
as the technique that helped uh chat GPT
break through it is a technique to make
the responses that are nicely formatted
like these Reddit answers more in tune
with what a human would like to read
this is done by collecting parse
preferences from actual humans out in
the world to start and now AIS are also
labeling this data and we'll get into
those trade-offs and you have this kind
of contrastive loss function between a
good answer and a bad answer and the
model learns to pick up these Trends
there's different implementation ways
you have things called reward models you
could have direct alignment algorithms
there's a lot of really specific things
you can do but all of this is about
fine-tuning to human
preferences and the final stage is much
newer and will'll link to what is done
in R1 and these reasoning models is I
think open ai's Nam for this they had
this new API in the fall which they
called the reinforcement fine-tuning
API this is the idea that you use the
techniques of reinforcement learning
which is a whole framework of AI there's
a deep literature here to summarize it's
often known as trial and error learning
or the subfield of AI where you're
trying to make sequential decisions in a
certain potentially un potentially noisy
environment there's a lot of ways we
could go down that but fine-tuning
language models where they can generate
an answer and then you check to see if
the answer matches the true solution for
math or code you have an exactly correct
answer for math you can have unit tests
for code and what we are doing is we are
checking the language models work and
we're giving it multiple opportunities
on the same questions see if it is right
and if you keep doing this the models
can learn to improve in verifiable
domains uh to a great extent it works
really well it's a newer technique in
the academic literature it's been used
at Frontier labs in the US that don't
share every detail uh for multiple years
so this is the idea of using
reinforcement learning with language
models and it has been taking off
especially in this deep seek moment and
we should say that there's a lot of
exciting stuff going on on the uh again
across the stack but the post training
probably this year there's going to be a
lot of interesting developments in the
post training we'll we'll talk about it
uh I almost forgot to talk about the the
difference between uh deep seek V3 and
R1 on the user experience side so forget
the technical stuff forget all that just
people that don't know anything about AI
they show up like what's the actual
experience what's the use case for each
one when they actually like type and
talk to it what what is he good at and
that kind of thing so let's start with
deep seek V3 again it's what more people
would have tried something like it you
ask it a question it'll start generating
tokens very fast and those tokens will
look like a very human legible answer
it'll be some sort of markdown list it
might have formatting to help you draw
to the core details in the answer and
it'll generate tens to hundreds of
tokens a token is normally a word for
common words or a subword part in a
longer word and it'll look like a very
high quality Reddit or stack Overflow
answer these models are really getting
good at doing these across a wide
variety of domains I think even things
that if you're an expert things that are
close to The Fringe of knowledge they
will still be fairly good at I think
Cutting Edge AI topics that I do
research on these models are capable for
study Aid and they're regularly updated
this changes is with the Deep seek R1
what is called these reasoning models is
when you see tokens coming from these
models to start it will be a
large chain of thought process we'll get
back to Chain of Thought in a second
which looks like a lot of tokens where
the model is explaining the problem the
model will often break down the problem
be like okay they asked me for this
let's break down the problem I'm going
to need to do this and you'll see all of
this generating from the model it'll
come very fast in most user experiences
these AP are very fast so you'll see a
lot of tokens a lot of words show up
really fast it'll keep flowing on the
screen and this is all the reasoning
process and then eventually the model
will change its tone in R1 and it'll
write the answer where it summarizes its
reading reasoning process and writes a
similar answer to the first types of
model but in deep seeks case which is
part of why this was so
popular even outside the AI Community is
that you can see how the language model
is breaking down problems and then you
get this answer on a technical side they
they train the model to do this
specifically where they have a section
which is reasoning and then it generates
a special token which is probably hidden
from the user most of the time which
says okay I'm starting the answer so the
model is trained to do this two-stage
process on its own if you use a similar
model and say openai open ai's user
interface is trying to summarize this
process for you nicely by kind of
showing the sections that the model is
doing and it'll kind of Click through
it'll say breaking down the problem
making
calculation cleaning the result and then
the answer will come for something like
open AI maybe it's useful here to go
through like an example of a deep seek
R1 reasoning yeah so the if if you're
looking at the screen here what you'll
see is a screenshot of the deep seek
chat app and at the top is thought for
1517 seconds with the drop- down arrow
underneath that if we were in an app
that we were running the drop- down
arrow would have the reasoning so in
this case uh the question the specific
question which you know
I'm philosophically SL pothead inclin so
this is uh asking deep deep SEC
car1 for one truly novel insight about
humans and it reveals the reasoning and
basically the TR truly novel aspect is
was pushing the reasoning to constantly
sort of the model asking itself is this
truly novel so it's actually challenging
itself to be more novel more
counterintuitive uh more uh less cringe
I suppose so some of the reasoning says
uh this is just snapshots alternatively
humans have a unique meta emotion where
they feel emotions about their own
emotions you feeling guilty about being
angry this recursive emotional layering
creates complex motivational drives that
don't exist in other animals the inside
is that human emotions are nested so
it's like it's reasoning through how
humans feel emotions it's reasoning
about meta emotions going to have pages
and Pages this it's almost too much to
actually read but it's nice to skim as
it's coming it's stream of it's a James
Joyce extreme of Consciousness and then
it goes wait the user wants something
that's not seen anywhere else let me dig
deeper and consider the human ability to
hold contradictory beliefs
simultaneously cognitive dissonance is
known but perhaps the function is to
allow flexible adaptation so on and so
forth I mean that really captures the
public imagination that holy this
isn't
uh I mean intelligent slash almost like
like an inkling of siience because like
you're thinking through you're
self-reflecting you're deliberating and
the final result of that after 157
seconds is humans instinctively convert
selfish desires into Cooperative systems
by collectively pretending abstract
rules money laws rights are real these
shared hallucinations act as quote games
where competition is secretly redirected
to benefit the group turning conflict
into society's fuel
pretty profound I mean you know this is
AAL digression but a lot of people have
found that these reasoning models can
sometimes produce much more eloquent
text that a at least interesting example
I think depending on how open minded you
are you find language models interesting
or not and there's a spectrum there well
I mean it's some of the we'll talk about
different benchmarks of s but some is
just a Vibe like that in itself is a
let's say quote fire tweet yeah if I
I'm trying to produce something
something where people are like oh
okay so that's CH thought we'll probably
return to it
more how are they able to achieve such
low cost on the training and the
inference maybe you could talk the
training first yeah so there's there's
two main techniques that they
implemented that are probably the
majority of their efficiency and then
there's a lot of implementation details
that maybe we'll gloss over or get into
later that sort of contribut to it but
those two main things are one is they
went to a mixture of experts model uh
which which we'll Define in a second and
then the other thing is that they
invented this new technique called MLA
lat and attention both of these are are
big deals mixture of experts is
something that's been in the literature
for a handful of years and open AI with
gp4 was the first one to productize a
mixture of experts model and what this
means is when you look at the common
models around uh that most people have
been able to interact with are open
right think llama llama is a dense model
I.E every single parameter or neuron is
activated as you're going through the
model for every single token you
generate right now with a mixture of
experts model you don't do that right
how how does a human actually work right
is like oh well my visual cortex is
active when I'm thinking about you know
Vision task and like you know other
things right my my amydala is when I'm
scared right these different aspects of
your brain are focused on different
things a mixture of experts model
attempts to approximate this to some
extent it's nowhere close to what a
brain architecture is but different
portions of the model activate right
you'll have a set number of experts in
the model and a set number that are
activated each time and this
dramatically reduces both your training
and inference cost because now you're
you know if you think about the
parameter count as the sort of total
embedding space for all of this
knowledge that you're compressing down
during
training when you're embedding this data
in instead of having to activate every
single parameter every single time
you're training or running inference now
you can just activate a subset and the
model will learn which expert to route
to for different tasks and so this is a
humongous innovation in terms of hey I
can continue to grow the total embedding
space of parameters and so deep seeks
model is you know 600 something billion
parameters right uh relative to llama
405b it's 405 billion parameters right
llama relative to llama 70b it's 70
billion parameters right so this model
technically has more embedding space for
information right to compress all of the
world's knowledge that's on the internet
down but at the same time it is only
activating around 37 billion of the
parameters so only 37 billion of these
parameters actually need to be computed
every single time you're training data
or inferencing data out of it and so
versus versus again the Llama model 70
billion parameters must be activated or
405 billion parameters must be activated
so you've dramatically reduced your
compute cost when you're doing training
and inference with this mixture of
experts architecture so we break down
where it actually applies and go into
the Transformer is that useful let's go
let's go into the Transformer the
Transformer is a thing that is talked
about a lot and we will not cover every
detail uh essentially the Transformer is
built on repeated blocks of this
attention mechanism and then a
traditional dense fully connected
multi-layer perception whatever word you
want to use for your normal neural
network and you alternate these blocks
there's other details and where mixture
of experts is applied is that this dense
model the dense model holds most of the
weights if you count them in a
Transformer model so you can get really
big gains from those mixure of experts
on parameter efficiency at training an
inference because you get this
efficiency by not activating all of
these parameters we should also say that
a Transformer is a giant neural network
yeah and then there's for 15 years now
there's What's called the Deep learning
Revolution networks gotten larger and
larger and at a certain point the
scaling laws appeared where people
realized this is a scaling law Shirt By
the way representing scaling laws where
it became more and more formalized that
bigger is better across multiple
dimensions of what bigger means so uh
and but these are all sort of neural
networks we're talking about and we're
talking about different architectures of
how construct to construct these neural
networks such that the training and the
inference on them is super efficient
yeah every different type of model has a
different scaling LW for it which which
is effectively for how much compute you
put in the architecture will get to
different levels of performance at test
tasks a mixture of experts is one of the
ones at training time even if you don't
consider the inference benefits which
are also big at training time your
efficiency with your gpus is
dramatically improved by using this
architecture if it is well implemented
so you can get effectively the same
performance model in evaluation scores
with numbers like 30% less compute I
think there's going to be a wide
variation depending on your
implementation details and stuff but it
is just important to realize that this
type of technical Innovation is
something that gives huge gains and I
expect most companies that are serving
their models to move to this mixture of
experts implementation historically the
reason why not everyone might do it is
because it's a implementation complexity
especially when doing these big models
so this is one of the things this deep
seek gets credit for is they do this
extremely well they do mixture of
experts extremely well this
architecture for what is called Deep
seek moee is the shortened version of
mixture of experts is multiple papers
old this part of their training
infrastructure is not new to these
models alone and same goes for what
Dylan mentioned with multi-ad lat and
attention this is all about reducing
memory usage during inference and same
things during training by using some
fancy low rank approximation math if you
get into the details with this latent
attention it's one of those things I
look at it's like okay this they're
doing really complex implementations
because there's other parts of language
model such as uh embeddings that are
used to extend the context length the
common one that deep seek used is Rotary
positional and pendings which is called
rope and if you want to use rope with a
normal Moe it's kind of a sequential
thing you take these you take two of the
attention matrices and you rotate them
by a complex value rotation which is a
matrix multiplication with deep seek MLA
with this new attention architecture
they need to do some clever things
because they're not set up the same and
it just makes the implementation
complexity much higher so they're
managing all of these things and these
are probably the sort of things that
open AI these closed labs are doing we
don't know if they're doing the exact
same techniques but they actually shared
them with the world which is really nice
to like this is The Cutting Edge of
efficient language model training and
some of this is requires low-level
engineering just is a giant mess and
trickery so as I understand they went
below Cuda so they go super low
programming of gpus effectively Nvidia
builds this Library called nickel right
uh in which you know when you're
training a model you have all these
communications between every single
layer of the model and you may have over
100 layers what does a nickel stand for
it's nccl Nvidia Communications
collectives Library nice um and so
D when when you're training a model
right you're going to have all these all
reduces and all gathers right uh between
each layer between the uh multier
perceptron or feed forward Network and
the attention mechanism you'll have
you'll have basically the model
synchronized right um or you'll have all
the you'll have all reducer and all
gather um and and this is a
communication between all the gpus in
the network whether whether it's in
training or inference so Nvidia has a
standard Library this is one of the
reasons why it's really difficult to use
anyone else's Hardware uh for training
is because no one's really built a
standard Communications Library um and
and nvidia's done this at a sort of a
higher level right a deep seek because
they have certain limitations around the
gpus that they have access to the
interconnects are limited to some extent
um by the restrictions of the gpus that
were shipped into China legally not the
ones that are smuggled but legally
shipped in uh that they used to train
this model they had to figure out how to
get efficiencies right and one of those
things is that instead of just calling
the Nvidia Library nickel right they
instead created their they scheduled
their own Communications uh which which
the lab some of the labs do right um em
meta talked about in llama 3 how they
made their own custom version of nickel
this is they didn't they didn't talk
about the implementation details this is
some of what they did probably not as
well as maybe not as well as deep seek
Because deep seek you know necessity is
the mother of innovation and they had to
do this whereas uh in the casa you know
open AI has people that do this sort of
stuff anthropic Etc uh but you know deep
seek certainly did it publicly and they
may have done it even better because
they were gimped on a certain aspect of
the chips that they have access to and
so they
scheduled
Communications um you know by scheduling
specific SMS SMS you could think of as
like the core on a GPU right so there's
hundreds of cores or there's you know a
bit over a 100 cores SMS on a GPU and
they were specifically scheduling hey
which ones are running the model which
ones are doing all reduce which one are
doing all gather right and they would
flip back and forth between them and
this requires extremely low-level
programming this is what nickel does
automatically or other Nvidia libraries
handle this automatically usually yeah
exactly and so so technically they're
using you know PTX which is like sort of
like you could think of it as like an
assembly type language it's not exactly
that or instruction set right like
coding directly to assembly or
instruction set it's not exactly that
but uh that's still part of technically
Cuda but it's like do I want to write in
Python you know pytorch equivalent and
call Nvidia libraries do I want to go
down to the ca level right or uh you
know and code even lower level or do I
want to go all the way down to the
assembly or Isa level and and there are
cases where you go all the way down
there at the very big Labs but most
companies just do not do that right
because it's a waste of time and the
efficiency gains you get are not worth
it but deep seeks implementation is so
complex right especially with their
mixture of experts right um people have
done mixture of experts but they're
generally 8 16 experts right and they
activate to so you know one of the words
we like Ed like to use is like sparsity
Factor right or usage right so so you
might have four you know one fourth of
your model activate right and and and
that's what Mist draws uh mixol model
right uh their model that really
catapulted them to like oh my God
they're really really good um openi has
also had models that aree and and so
have all the other labs that are major
closed but what deep seek did that maybe
only the leading Labs have only just
started recently doing is have such a
high sparity factor right it's not 1/4
of the model right two out of eight
experts activating every time you go
through the model it's eight out of 256
and there's different implementations
for mixture of experts where you can
have some of these experts that are ever
always activated which this just looks
like a small neural network and all the
tokens go through that and then they
also go through some that are selected
by this routing mechanism and one of the
Innovations in deep seeks architecture
is that they change the routing
mechanism in mixture of expert models
there's something called an auxiliary
loss which effectively means during
training you want to make sure that all
of these experts are used across the
tasks that the model sees why there can
be failur and mixture of experts is that
when you're doing this training the one
objective is token prediction accuracy
and if you just let toing go with a
mixture of expert model on your own it
can be that the model learns to only use
a subset of the experts and in thee
literature there's something called the
auxiliary loss which helps balance them
but if you think about the loss
functions of deep learning this even
connects to the bitter lesson is that
you want to have the minimum inductive
bias in your model to let the model
learn maximally and this auxiliary loss
this balancing across experts could be
seen as intention with the prediction
accuracy of the tokens so we don't know
the exact extent that the Deep seeke
change which is instead of doing an
auxiliary loss they have an extra
parameter in their routing which after
the batches they update this parameter
to make sure that the next batches all
have a similar use of experts and this
type of change can be big it can be
small but they add up over time and this
is the sort of thing that just points to
them innovating and I'm sure all the
labs that are training biges are looking
at this sort of things which is getting
away from the auxiliary loss some of
them might already use it but you just
keep you keep accumulating gains and
we'll talk about the philosophy of
training and how you organize these
organizations and a lot of it is just
compounding small improvements over time
in your data in your architecture and
your post trainining and how they
integrate with each other deep seek does
the same thing and some of them are
shared or a lot we have to take them on
face value that they share their most
important details I mean the
architecture and the weights are out
there so we're seeing what they're doing
and it adds up going back to sort of the
like efficiency and complexity point
right it's 32 versus a four right for
like mix draw and othere models that
have been publicly released so this
ratio is extremely high and sort of what
Nathan was getting at there was when you
have such a different level of sparsity
um you can't just have every GPU have
the entire model right the model's too
big there's too much complexity there so
you have to split up the model um with
different types of parallelism right and
so you might have different experts on
different GPU nodes but now what what
happens when a to you know this set of
data that you get hey all of it looks
like this one way and all of it should
route to one part of my you know model
right um so so when all of it rout
routes to one part of the model then you
can have the you can have this
overloading of a s certain set of the
GPU resources or certain set of the gpus
and then the rest of the the training
Network sits idle because all of the
tokens are just routing to that so this
is the biggest complexity one of the big
complexities with running a very you
know sparse mixture of experts model uh
I.E you know this 32 ratio versus this
uh four ratio is that you end up with so
many of the experts just sitting their
Idol so how do I load balance between
them how do I schedule the
communications between them this is a
lot of the like extremely low-level
detailed work that they figured out in
the public first and potentially like
second or third and the world and maybe
even first in some cases what uh lesson
do you uh in the direction of the better
lesson do you take from all of this
where is this going to be the direction
where a lot of the gain is going to be
which is this kind of lowlevel
optimization or is this a shortterm
thing where the biggest gains will be
more on the algorithmic high level side
of like posttraining is is this like a
short-term leap because they figured out
like a hack because constraints
Necessities the mother of invention or
is is there still a lot of gains I think
we should summarize what the bitter
lesson actually is about is I the bitter
lesson essentially if you paraphrase it
is that the types of training that will
win out in deep learning as we go are
those methods that which are scalable in
learning and search is what it calls out
and the scale word gets a lot of
attention in this the interpretation
that I use is effective
to avoid adding the human priors to your
learning process and if you read the
original essay this is what it talks
about is how researchers will try to
come up with clever solutions to their
specific problem that might get them
small gains in the short term while
simply enabling these deep Learning
Systems to work efficiently and for
these bigger problems in the long term
might be more likely to scale and
continue to drive success
and therefore we were talking about
relatively small implementation changes
to the mixture of experts model and
therefore it's like okay like we will
need a few more years to know if one of
these were actually really crucial to
the bitter lesson but the bitter lesson
is really this long-term Arc of how
Simplicity can often win and there's a
lot of sayings in the industry like the
models just want to learn you have to
give them the simple lost landscape
where you put compute through the model
and and they will learn and get barriers
out of the way that that's where the
power something like nickel comes in
where standardized code that could be
used by a lot of people to create sort
of simple innovations that can scale
which is why the hacks the I imagine the
code base for deep seek is probably a
giant mess I'm sure they have deep seek
definitely has code bases that are
extremely messy where they're testing
these new ideas multi-head late in
attention probably start could start in
something like a Jupiter notebook or
somebody tries something on a few gpus
and that is really messy but the stuff
that trains deep seek V3 and deep seek
R1 those libraries if you were to
present them to us I would guess are
extremely high quality code high quality
readable code I think there is one
aspect to note though right is that
there is the general General ability for
that to transfer across different types
of runs right you may make really really
high quality code for one specific model
architecture at one size
and then that is not transferable to hey
when when I make this architecture tweak
everything's broken again right like
that's that's something that could be uh
you know with their with their specific
low-l coding of like scheduling SMS is
specific to this model architecture and
size right and whereas like nvidia's
collectives library is more like hey
it'll work for anything right you want
to do an all reduce great I don't care
what your model architecture is it'll
work uh and you're giving up a lot of
performance when you do that uh in many
cases but it's it's worth for them to do
the specific uh optimization for the
specific run given the constraints that
they have regarding compute I wonder how
stressful it is to like you know these
Frontier models like initiate training
like to have the
code to push the button that like you're
now spending a large amount of money and
time to train this like there must I
mean there must be a lot of innovation
on the debugging stage of like making
sure there's no know issues that you're
monitoring and visualizing every aspect
of the training all that kind of stuff
when when people are training they have
all these various dashboards but like
the most simple one is your loss right
uh and it continues to go down but in
reality especially with more complicated
stuff likee the biggest problem with it
or FPA training which is another
Innovation you know going to a lower
Precision number format I.E less
accurate is that you end up with lost
bikes right and and no one knows why the
Lost bike happen and for long some of
them you do some of them you do some of
them are data I give a ai's example of
what blew up our earlier models is a
subreddit called microwave gang we love
to shout this out it's a real thing you
can pull up microwave gang essentially
it's a subreddit where everybody makes
posts that are just the letter M so it's
like so there's extremely long sequences
of the letter M and then the comments
are like beep beep because that's when
the microwave ends but if you pass this
into a model that's trained to be a
normal producing text it's extremely
high loss because normally you see an M
you don't predict M's for a long time so
like this is something that causes a l
spikes for us but when you have much
like this is this is old this is not
recent and when you have more mature
Data Systems that's not the thing that
causes the LW Spike and what Dylan is
saying is true but it's like it's it's
levels to this sort of idea with regards
to the stress right these people are
like you know you'll go out to dinner
with like a friend that works at one of
these labs and they'll just be they'll
just be like looking at their phone
every like 10 minutes and they're not
like you know it's one thing if they're
texting but they're just like like is
the Lost is the L
tokens tokens per second lost not blown
up they're just walking watching this
and the heart rate goes up if there's a
spike and some level of spikes is normal
right it'll it'll recover and be back
sometimes a lot of the old strategy was
like you just stop the run restart from
the old version and then like change the
data mix and then it keeps going there
are even different types of spikes so
Durk grenal has a theory A2 that's like
Fast spikes and slow spikes where there
are sometimes where you're looking at
the loss and there other parameters you
can see it start to creep up and then
blow up and that's really hard to
recover from so you have to go back much
further so you have the stressful period
where it's like flat or might start
going up and you're like what do I do
whereas there also law spikes that are
it looks good and then there's one spiky
data point and what you can do is you
just skip those you you see that there's
a spike you're like okay I can ignore
this data don't update the model and do
the next one and it'll recover quickly
but these like un trickier
implementations so as you get more
complex in your architecture and you
scale up to more gpus you have more
potential for your loss blowing up so
it's like there there's and there's a
distribution the whole idea of grocking
also comes in right it's like just
because it slowed down from improving
and loss doesn't mean it's not learning
because all of a sudden it could be like
this and it could just Spike down and
loss again because it learned truly
learned something right uh and it took
some time for it to learn that it's not
like a gradual process right and that's
that's what humans are like that's what
models are like so it's it's really a
stressful task as you mentioned and the
whole time the the the dollar count is
going up every company has failed runs
you need failed runs to push the
envelope on your infrastructure so a lot
of news Cycles are made of X company had
y failed run every company that's trying
to push the frontier of AI has these so
is yes it's noteworthy because it's a
lot of money and it can be weektoon
setback but it is part of the process
but how do you get if you're deep seek
how do you get to a place where holy
there's a successful combination of
hyper parameters a lot of small failed
runs and so so rapid uh iteration
through failed runs until and successful
ones you just and then you build a su
tuation like this this mixture of expert
works and then this implementation of
MLA works key hyper parameters like
learning rate and
regularization and things like this and
you find the regime that works for your
code base I've talking to people at
Frontier Labs there's a story that you
can tell where training language models
is kind of a path that you need to
follow so you need to like unlock the
ability to train a certain type of model
or a certain scale and then your code
base and your internal knoow what type
of parameters work for it is kind of
known and you look at the Deep seek
papers and models they' they've scaled
up they've added complexity and it's
just continuing to build the
capabilities that they have there
there's the concept of a YOLO run um so
YOLO you only live once um and and what
it is is like you know there's there's
there's all this experimentation you do
at the small scale right uh research
ablations right like you have your
jupyter notebook whether you're
experimenting with MLA on like three
gpus or whatever um and you're doing all
these different uh things like hey do I
do four expert four active experts 128
experts do I arrange the experts this
way you know all these different uh
model architecture things you're testing
at a very small scale right couple
researchers few gpus tens of gpus
hundreds of gpus whatever it is and then
all of a sudden you're like okay guys no
more no more around right uh no
more screwing around everyone take all
the resources we have let's pick what we
think will work and just go for it right
YOLO and this is where that sort of
stress comes in is like well I know it
works here but some things that work
here don't work here and some things
that work here don't work down here
right in this terms of scale right so
it's it's it's really truly a YOLO run
and and sort of like there is this like
like discussion of like certain
researchers just have like this
methodical nature like they can find the
whole search space and like figure out
all the ablations of different research
and really see what is best and there's
certain researchers who just kind of
like you
Resume
Read
file updated 2026-02-14 14:51:02 UTC
Categories
Manage