DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters | Lex Fridman Podcast #459
_1f-o0nqpEI • 2025-02-03
Transcript preview
Open
Kind: captions Language: en the following is a conversation with Dylan Patel and Nathan Lambert Dylan runs semi analysis A well respected research and Analysis company that specializes in semiconductors gpus CPUs and AI Hardware in general Nathan is a research scientist at the Allen Institute for AI and is the author of the amazing blog on AI called interconnects they are both highly respected red and and listened to by the experts researchers and engineers in the field of AI and personally I'm just a fan of the two of them so I used the Deep seek moment that shook the AI World a bit as an opportunity to sit down with them and lay it all out from Deep seek open AI Google xai meta anthropic to Nvidia and tsmc and to us China Taiwan relations and everything else that is happening at the cutting Ed of AI this conversation is a deep dive into many critical aspects of the AI industry while it does get super technical we try to make sure that it's still accessible to folks outside of the AI field by defining terms stating important Concepts explicitly spelling out acronyms and in general always moving across the several layers of abstraction and levels of detail there is a lot of hype in the media about what AI is and isn't the purpose of this podcast in part is to cut through the hype through the and the low resolution analysis and to discuss in detail how stuff works and what the implications are let me also if I may comment on the new open AI 03 mini reasoning model the release of which we were anticipating during the conversation and it did indeed come out right after its capabilities and costs are on par with our expectations as we stated open AI 03 mini is indeed a great model but it should be stated that uh deep SEC car 1 has similar performance on benchmarks is still cheaper and it reveals its Chain of Thought reasoning which O3 mini does not it only shows a summary of the reasoning plus R1 is open weight and uh 03 mini is not by the way I got a chance to play with uh O3 mini and uh anecdotal Vibe checkwise I felt that O3 mini specifically O3 mini high is uh better than R1 still for me personally I find that Claude Sona 35 is the best model for programming except for tricky cases where I will use 01 Pro to brainstorm either way many more better AI models will come including reasoning models both from American and Chinese companies they will continue to shift the cost curve but the quote deep seek moment is indeed real I think it will still be remembered 5 years from now as a pivotal event in Tech History due in part to the geopolitical implications but for other reasons too as we discuss in detail from many perspectives in this conversation this is leex Freedman podcast to support it please check out our sponsors in the description and now dear friends here's Dyan Patel and Nathan Lambert a lot of people are curious to understand China's deep seki models so let's lay it out Nathan can you describe what deep seek V3 and deep seek R1 are how they work how they're trained Let's uh look at the big picture and then we'll zoom in on the details yeah so deep seek V3 is a new mixture of experts Transformer language model from Deep seek who is based in China they have some new specifics in the model that we'll get into largely this is a open weight model and it's a instruction model like what you would use in chat GPT um they also release what is called the base model which is before these techniques of posttraining most people use instruction models today and those are what's served in all sorts of applications this was released on I believe December 26th or that week and then weeks later on January 20th deep seek released deep seek R1 which is a reasoning model which really accelerated a lot of this discussion this resenting model has a lot of overlapping training steps to deep seek V3 and it's confusing that you have a base model called V3 that you do some too to get a chat model and then you do some different things to get a reasoning model I think a lot of the AI industry is going through this challenge of communications right now where open AI makes fun of their own naming scheme they have gbt 4 they have open ai1 and there's a lot of types of models so we're going to break down what each of them are there's a lot of technical specifics on training and go from high level to specific and kind of go through each of them there's so many places we can go here but maybe let's go to open weights first what does it mean for model to be open weights and what are the different flavors of Open Source in general yeah so this discussion has been going on for a long time in AI it became more important since chat gbt or more focal since trat BT at the end of 2022 open weights is the accepted term for um when model weights of a language model are available on the internet for people to download those weights can have different licenses which is the effectively the terms by which you can use the model there are licenses that come from history and open source software there are licenses that are designed by companies specifically um all of llama deep seek quen mistol these popular names in open weight models have some of their own licenses it's complicated because not all the same models have the same terms the big debate is on what makes a model open weight it's like why are we saying this term it's kind of a mouthful it sounds close to open source but it's not the same there's still a lot of debate on the definition and soul of open- source AI open source software has a rich history on freedom to modify freedom to take on your own freedom for many restrictions on how you would use the software and what that means for AI is still being defined so uh for what I do I work at the Allen Institute for AI we're a nonprofit We want to make AI open for everybody and we try to lead on what we think is truly open source there's not full agreement in the community but for us that means releasing the training data releasing the training code and then also having open weights like this and we'll get into the details of the models and again and again as we try to get deep into how the models will train were trained we will say things like the data processing data filtering data quality is the number one determinant of the model quality and then a lot of the training code is the determinant on how long it takes to train and how faster experimentation is so without fully open- Source models where you have access to this data it is hard to know or it's harder to replicate so we'll get into cost numbers for deeps B3 on mostly GPU hours and how much you could pay to rent those yourselves but without the data the replication cost is going to be far far higher and same goes for the code we should also say that this is probably one of the more open models out of the frontier models so like in this full spectrum where probably the fullest open source like you said open code open data open weights this is not open code this is probably not open data and this is open weights and the licensing is uh MIT license or it's uh I mean there's some nuance and the different models but it's towards the free in terms of the open source movement these are the kind of the good guys yeah deep seek is doing fantastic work for disseminating understanding of AI their papers are extremely detailed in what they do and for other teams around the world they're very actionable in terms of improving your own training techniques uh and we'll talk about licenses more the Deep seek R1 model has a very permissive license it's called the M license that effectively means there's no Downstream restrictions on commercial use there's no use case restrictions you can use the outputs from the models to create synthetic data and this is all fantastic I think the closest pier is something like llama where you have the weights and you have a technical report and the technical report is very good for llama one of the most red p PDFs of the year last year is the Llama 3 paper but in some ways it's slightly less actionable it has less details on the training specifics like less plots um and so on and the Llama 3 license is more restrictive than MIT and then between the deep sea custom license and the Llama license we could get into this whole Rabbit Hole I think we we we'll make sure we want to go down the license rabbit hole before we do specifics yeah and I mean so it should be stated that one of the implications of deep secret puts pressure on llama and everybody else on open AI to push towards uh open source and that's the other side of Open Source that uh you mentioned is how much is published in detail about it so how open are you with the sort of the insights behind the code so like how good is the technical reports are they hand wavy or is there actual uh details in there and that's one of the things that deep seek did well is they publish a lot of the details yeah especially in the deeps V3 which is their pre-training paper they were very clear that they are doing inter itions on the technical stack that go at many different levels for example on their to get highly efficient training they're making modifications at or below the Cuda layer for NVIDIA chips I have never worked there myself and there are a few people in the world that do that very well and some of them are at Deep seek and these types of people are at Deep seek and leading American frontier Labs but there are not many places to help people understand the other implication of open weights just you know there's uh a topic we'll return to often here so there's a uh fear that China the nation might have interest in um stealing American data violating privacy of American citizens what can we say about open weights to help us understand what what the weights are able to do yeah in terms of stealing people's data yeah so these weights that you can download from hugging face or other platforms are very big matrices of numbers you can download them to a computer in your own house that has no internet and you can run this model and you're totally control of your data that is something that is different than how a lot of language model usage is actually done today which is mostly through apis where you send your prompt to gpus run by certain companies and these companies will have different distributions and policies on how your data is stored if it is used to train future models where it is stored if it is encrypted and so on so the open weights you have your fate of data in your own hands and that is something that is deeply connected to the soul of Open Source so it's not the model that steals your data it's clovers hosting the model which could be China if you're using the Deep seek app or it could be perplexity uh you know you're trusting them with your data or open AI you're trusting them with your data and some of these are American companies some of these are Chinese companies but the model itself is not doing the stealing it's the host all right so uh back to the basics what's the difference between deep seek V3 and deep seek R1 can we try to like lay out the confusion potential yes so for one I have very understanding of many people being confused by these two model names so I would say the best way to think about this is that when training a language model you have what is called pre-training which is when you're predicting the large amounts of mostly internet text you're trying to predict the next token and what to know about these new deep seek models is that they do this internet large scale pre-training once to get what is called Deep seek V3 base this is a base model it's just going to finish your sentences for you it's going to be harder to work with than chat GPT and then what deep seek did is they've done two different posttraining regimes to make the models have specific desirable behaviors so what is the more normal model in terms of the last few years of AI an instruct model a chat model a quote unquote aligned model a helpful model there are many ways to describe this is more standard post training so this is things like instruction tuning reinforce learning from Human feedback we'll get into some of these words and this is what they did to create the deeps V3 model this was the first Model to be released and it is very high performant it's competitive with gp4 llama 405b so on and then when this release was happening we don't know their exact timeline or soon after they were finishing the training of a different training process from the same next token prediction base model that I talked about which is when this new reasoning training that people have heard about comes in in order to create the model that is called Deep seek R1 the r through this conversation is good for grounding for reasoning and the name is also similar to open AI 01 which is other reasoning model that people have heard about and we have to break down the training for R1 in more detail because for one we have a paper detailing it but also it is a far newer set of techniques for the AI community so is a much more rapidly evolving area of research maybe we should also say the big two categories of training of pre-training and posttraining these umbrella terms that people use so what is pre-training and what is posttraining and what are the different flavors of things underneath posttraining umbrella yeah so pre-training I'm using some of the same words to really get the message across is you're doing what is called autor regressive prediction to predict the next token in a series of documents this is done over standard practice is trillions of tokens so this is a ton of data that is mostly scraped from the web in some of deep se's earlier papers they talk about their training data being distilled for Math and I shouldn't use this word yet but taken from common crawl and that's a public access that anyone listening to this could go download data from the common crawl website this is a crawler that is maintained publicly yes other tech companies eventually shift to their own crawler and deepy likely has done this as well as most Frontier Labs do but this sort of data is something that people can get started with and you're just predicting text in a series of documents this is can be scaled to be very efficient and there's a lot of numbers that are thrown around in AI training like how many floating Point operations or flops are used and then you can also look at how many hours of these gpus that are used and it's largely one loss function taken to a very large amount of of compute usage you just you set up really efficient systems and then at the end of that you have this space model and pre-training is where there is a lot more of complexity in terms of how the process is emerging or evolving and the different types of training losses will use I think this is a lot of techniques grounded in the natural language processing literature the oldest technique which is still used today is something called instruction tuning or also known as supervised fine tuning these acronyms will be if or sft it's that people really go back and forth throughout them and I will probably do the same which is where you add this formatting to the model where it knows to take a question that is like explain the history of the Roman Empire iror to me and or something you a sort of question you'll see on Reddit or stack Overflow and then the model will respond in a information dense but presentable manner the core of that formatting is in this instruction tuning phase and then there's two other categories of loss functions that are being used today one I will classify as preference fine tuning preference fine tuning is a generalized term for what came out of reinforcement learning from Human feedback which is rhf this reinforce learning from Human feedback is credited as the technique that helped uh chat GPT break through it is a technique to make the responses that are nicely formatted like these Reddit answers more in tune with what a human would like to read this is done by collecting parse preferences from actual humans out in the world to start and now AIS are also labeling this data and we'll get into those trade-offs and you have this kind of contrastive loss function between a good answer and a bad answer and the model learns to pick up these Trends there's different implementation ways you have things called reward models you could have direct alignment algorithms there's a lot of really specific things you can do but all of this is about fine-tuning to human preferences and the final stage is much newer and will'll link to what is done in R1 and these reasoning models is I think open ai's Nam for this they had this new API in the fall which they called the reinforcement fine-tuning API this is the idea that you use the techniques of reinforcement learning which is a whole framework of AI there's a deep literature here to summarize it's often known as trial and error learning or the subfield of AI where you're trying to make sequential decisions in a certain potentially un potentially noisy environment there's a lot of ways we could go down that but fine-tuning language models where they can generate an answer and then you check to see if the answer matches the true solution for math or code you have an exactly correct answer for math you can have unit tests for code and what we are doing is we are checking the language models work and we're giving it multiple opportunities on the same questions see if it is right and if you keep doing this the models can learn to improve in verifiable domains uh to a great extent it works really well it's a newer technique in the academic literature it's been used at Frontier labs in the US that don't share every detail uh for multiple years so this is the idea of using reinforcement learning with language models and it has been taking off especially in this deep seek moment and we should say that there's a lot of exciting stuff going on on the uh again across the stack but the post training probably this year there's going to be a lot of interesting developments in the post training we'll we'll talk about it uh I almost forgot to talk about the the difference between uh deep seek V3 and R1 on the user experience side so forget the technical stuff forget all that just people that don't know anything about AI they show up like what's the actual experience what's the use case for each one when they actually like type and talk to it what what is he good at and that kind of thing so let's start with deep seek V3 again it's what more people would have tried something like it you ask it a question it'll start generating tokens very fast and those tokens will look like a very human legible answer it'll be some sort of markdown list it might have formatting to help you draw to the core details in the answer and it'll generate tens to hundreds of tokens a token is normally a word for common words or a subword part in a longer word and it'll look like a very high quality Reddit or stack Overflow answer these models are really getting good at doing these across a wide variety of domains I think even things that if you're an expert things that are close to The Fringe of knowledge they will still be fairly good at I think Cutting Edge AI topics that I do research on these models are capable for study Aid and they're regularly updated this changes is with the Deep seek R1 what is called these reasoning models is when you see tokens coming from these models to start it will be a large chain of thought process we'll get back to Chain of Thought in a second which looks like a lot of tokens where the model is explaining the problem the model will often break down the problem be like okay they asked me for this let's break down the problem I'm going to need to do this and you'll see all of this generating from the model it'll come very fast in most user experiences these AP are very fast so you'll see a lot of tokens a lot of words show up really fast it'll keep flowing on the screen and this is all the reasoning process and then eventually the model will change its tone in R1 and it'll write the answer where it summarizes its reading reasoning process and writes a similar answer to the first types of model but in deep seeks case which is part of why this was so popular even outside the AI Community is that you can see how the language model is breaking down problems and then you get this answer on a technical side they they train the model to do this specifically where they have a section which is reasoning and then it generates a special token which is probably hidden from the user most of the time which says okay I'm starting the answer so the model is trained to do this two-stage process on its own if you use a similar model and say openai open ai's user interface is trying to summarize this process for you nicely by kind of showing the sections that the model is doing and it'll kind of Click through it'll say breaking down the problem making calculation cleaning the result and then the answer will come for something like open AI maybe it's useful here to go through like an example of a deep seek R1 reasoning yeah so the if if you're looking at the screen here what you'll see is a screenshot of the deep seek chat app and at the top is thought for 1517 seconds with the drop- down arrow underneath that if we were in an app that we were running the drop- down arrow would have the reasoning so in this case uh the question the specific question which you know I'm philosophically SL pothead inclin so this is uh asking deep deep SEC car1 for one truly novel insight about humans and it reveals the reasoning and basically the TR truly novel aspect is was pushing the reasoning to constantly sort of the model asking itself is this truly novel so it's actually challenging itself to be more novel more counterintuitive uh more uh less cringe I suppose so some of the reasoning says uh this is just snapshots alternatively humans have a unique meta emotion where they feel emotions about their own emotions you feeling guilty about being angry this recursive emotional layering creates complex motivational drives that don't exist in other animals the inside is that human emotions are nested so it's like it's reasoning through how humans feel emotions it's reasoning about meta emotions going to have pages and Pages this it's almost too much to actually read but it's nice to skim as it's coming it's stream of it's a James Joyce extreme of Consciousness and then it goes wait the user wants something that's not seen anywhere else let me dig deeper and consider the human ability to hold contradictory beliefs simultaneously cognitive dissonance is known but perhaps the function is to allow flexible adaptation so on and so forth I mean that really captures the public imagination that holy this isn't uh I mean intelligent slash almost like like an inkling of siience because like you're thinking through you're self-reflecting you're deliberating and the final result of that after 157 seconds is humans instinctively convert selfish desires into Cooperative systems by collectively pretending abstract rules money laws rights are real these shared hallucinations act as quote games where competition is secretly redirected to benefit the group turning conflict into society's fuel pretty profound I mean you know this is AAL digression but a lot of people have found that these reasoning models can sometimes produce much more eloquent text that a at least interesting example I think depending on how open minded you are you find language models interesting or not and there's a spectrum there well I mean it's some of the we'll talk about different benchmarks of s but some is just a Vibe like that in itself is a let's say quote fire tweet yeah if I I'm trying to produce something something where people are like oh okay so that's CH thought we'll probably return to it more how are they able to achieve such low cost on the training and the inference maybe you could talk the training first yeah so there's there's two main techniques that they implemented that are probably the majority of their efficiency and then there's a lot of implementation details that maybe we'll gloss over or get into later that sort of contribut to it but those two main things are one is they went to a mixture of experts model uh which which we'll Define in a second and then the other thing is that they invented this new technique called MLA lat and attention both of these are are big deals mixture of experts is something that's been in the literature for a handful of years and open AI with gp4 was the first one to productize a mixture of experts model and what this means is when you look at the common models around uh that most people have been able to interact with are open right think llama llama is a dense model I.E every single parameter or neuron is activated as you're going through the model for every single token you generate right now with a mixture of experts model you don't do that right how how does a human actually work right is like oh well my visual cortex is active when I'm thinking about you know Vision task and like you know other things right my my amydala is when I'm scared right these different aspects of your brain are focused on different things a mixture of experts model attempts to approximate this to some extent it's nowhere close to what a brain architecture is but different portions of the model activate right you'll have a set number of experts in the model and a set number that are activated each time and this dramatically reduces both your training and inference cost because now you're you know if you think about the parameter count as the sort of total embedding space for all of this knowledge that you're compressing down during training when you're embedding this data in instead of having to activate every single parameter every single time you're training or running inference now you can just activate a subset and the model will learn which expert to route to for different tasks and so this is a humongous innovation in terms of hey I can continue to grow the total embedding space of parameters and so deep seeks model is you know 600 something billion parameters right uh relative to llama 405b it's 405 billion parameters right llama relative to llama 70b it's 70 billion parameters right so this model technically has more embedding space for information right to compress all of the world's knowledge that's on the internet down but at the same time it is only activating around 37 billion of the parameters so only 37 billion of these parameters actually need to be computed every single time you're training data or inferencing data out of it and so versus versus again the Llama model 70 billion parameters must be activated or 405 billion parameters must be activated so you've dramatically reduced your compute cost when you're doing training and inference with this mixture of experts architecture so we break down where it actually applies and go into the Transformer is that useful let's go let's go into the Transformer the Transformer is a thing that is talked about a lot and we will not cover every detail uh essentially the Transformer is built on repeated blocks of this attention mechanism and then a traditional dense fully connected multi-layer perception whatever word you want to use for your normal neural network and you alternate these blocks there's other details and where mixture of experts is applied is that this dense model the dense model holds most of the weights if you count them in a Transformer model so you can get really big gains from those mixure of experts on parameter efficiency at training an inference because you get this efficiency by not activating all of these parameters we should also say that a Transformer is a giant neural network yeah and then there's for 15 years now there's What's called the Deep learning Revolution networks gotten larger and larger and at a certain point the scaling laws appeared where people realized this is a scaling law Shirt By the way representing scaling laws where it became more and more formalized that bigger is better across multiple dimensions of what bigger means so uh and but these are all sort of neural networks we're talking about and we're talking about different architectures of how construct to construct these neural networks such that the training and the inference on them is super efficient yeah every different type of model has a different scaling LW for it which which is effectively for how much compute you put in the architecture will get to different levels of performance at test tasks a mixture of experts is one of the ones at training time even if you don't consider the inference benefits which are also big at training time your efficiency with your gpus is dramatically improved by using this architecture if it is well implemented so you can get effectively the same performance model in evaluation scores with numbers like 30% less compute I think there's going to be a wide variation depending on your implementation details and stuff but it is just important to realize that this type of technical Innovation is something that gives huge gains and I expect most companies that are serving their models to move to this mixture of experts implementation historically the reason why not everyone might do it is because it's a implementation complexity especially when doing these big models so this is one of the things this deep seek gets credit for is they do this extremely well they do mixture of experts extremely well this architecture for what is called Deep seek moee is the shortened version of mixture of experts is multiple papers old this part of their training infrastructure is not new to these models alone and same goes for what Dylan mentioned with multi-ad lat and attention this is all about reducing memory usage during inference and same things during training by using some fancy low rank approximation math if you get into the details with this latent attention it's one of those things I look at it's like okay this they're doing really complex implementations because there's other parts of language model such as uh embeddings that are used to extend the context length the common one that deep seek used is Rotary positional and pendings which is called rope and if you want to use rope with a normal Moe it's kind of a sequential thing you take these you take two of the attention matrices and you rotate them by a complex value rotation which is a matrix multiplication with deep seek MLA with this new attention architecture they need to do some clever things because they're not set up the same and it just makes the implementation complexity much higher so they're managing all of these things and these are probably the sort of things that open AI these closed labs are doing we don't know if they're doing the exact same techniques but they actually shared them with the world which is really nice to like this is The Cutting Edge of efficient language model training and some of this is requires low-level engineering just is a giant mess and trickery so as I understand they went below Cuda so they go super low programming of gpus effectively Nvidia builds this Library called nickel right uh in which you know when you're training a model you have all these communications between every single layer of the model and you may have over 100 layers what does a nickel stand for it's nccl Nvidia Communications collectives Library nice um and so D when when you're training a model right you're going to have all these all reduces and all gathers right uh between each layer between the uh multier perceptron or feed forward Network and the attention mechanism you'll have you'll have basically the model synchronized right um or you'll have all the you'll have all reducer and all gather um and and this is a communication between all the gpus in the network whether whether it's in training or inference so Nvidia has a standard Library this is one of the reasons why it's really difficult to use anyone else's Hardware uh for training is because no one's really built a standard Communications Library um and and nvidia's done this at a sort of a higher level right a deep seek because they have certain limitations around the gpus that they have access to the interconnects are limited to some extent um by the restrictions of the gpus that were shipped into China legally not the ones that are smuggled but legally shipped in uh that they used to train this model they had to figure out how to get efficiencies right and one of those things is that instead of just calling the Nvidia Library nickel right they instead created their they scheduled their own Communications uh which which the lab some of the labs do right um em meta talked about in llama 3 how they made their own custom version of nickel this is they didn't they didn't talk about the implementation details this is some of what they did probably not as well as maybe not as well as deep seek Because deep seek you know necessity is the mother of innovation and they had to do this whereas uh in the casa you know open AI has people that do this sort of stuff anthropic Etc uh but you know deep seek certainly did it publicly and they may have done it even better because they were gimped on a certain aspect of the chips that they have access to and so they scheduled Communications um you know by scheduling specific SMS SMS you could think of as like the core on a GPU right so there's hundreds of cores or there's you know a bit over a 100 cores SMS on a GPU and they were specifically scheduling hey which ones are running the model which ones are doing all reduce which one are doing all gather right and they would flip back and forth between them and this requires extremely low-level programming this is what nickel does automatically or other Nvidia libraries handle this automatically usually yeah exactly and so so technically they're using you know PTX which is like sort of like you could think of it as like an assembly type language it's not exactly that or instruction set right like coding directly to assembly or instruction set it's not exactly that but uh that's still part of technically Cuda but it's like do I want to write in Python you know pytorch equivalent and call Nvidia libraries do I want to go down to the ca level right or uh you know and code even lower level or do I want to go all the way down to the assembly or Isa level and and there are cases where you go all the way down there at the very big Labs but most companies just do not do that right because it's a waste of time and the efficiency gains you get are not worth it but deep seeks implementation is so complex right especially with their mixture of experts right um people have done mixture of experts but they're generally 8 16 experts right and they activate to so you know one of the words we like Ed like to use is like sparsity Factor right or usage right so so you might have four you know one fourth of your model activate right and and and that's what Mist draws uh mixol model right uh their model that really catapulted them to like oh my God they're really really good um openi has also had models that aree and and so have all the other labs that are major closed but what deep seek did that maybe only the leading Labs have only just started recently doing is have such a high sparity factor right it's not 1/4 of the model right two out of eight experts activating every time you go through the model it's eight out of 256 and there's different implementations for mixture of experts where you can have some of these experts that are ever always activated which this just looks like a small neural network and all the tokens go through that and then they also go through some that are selected by this routing mechanism and one of the Innovations in deep seeks architecture is that they change the routing mechanism in mixture of expert models there's something called an auxiliary loss which effectively means during training you want to make sure that all of these experts are used across the tasks that the model sees why there can be failur and mixture of experts is that when you're doing this training the one objective is token prediction accuracy and if you just let toing go with a mixture of expert model on your own it can be that the model learns to only use a subset of the experts and in thee literature there's something called the auxiliary loss which helps balance them but if you think about the loss functions of deep learning this even connects to the bitter lesson is that you want to have the minimum inductive bias in your model to let the model learn maximally and this auxiliary loss this balancing across experts could be seen as intention with the prediction accuracy of the tokens so we don't know the exact extent that the Deep seeke change which is instead of doing an auxiliary loss they have an extra parameter in their routing which after the batches they update this parameter to make sure that the next batches all have a similar use of experts and this type of change can be big it can be small but they add up over time and this is the sort of thing that just points to them innovating and I'm sure all the labs that are training biges are looking at this sort of things which is getting away from the auxiliary loss some of them might already use it but you just keep you keep accumulating gains and we'll talk about the philosophy of training and how you organize these organizations and a lot of it is just compounding small improvements over time in your data in your architecture and your post trainining and how they integrate with each other deep seek does the same thing and some of them are shared or a lot we have to take them on face value that they share their most important details I mean the architecture and the weights are out there so we're seeing what they're doing and it adds up going back to sort of the like efficiency and complexity point right it's 32 versus a four right for like mix draw and othere models that have been publicly released so this ratio is extremely high and sort of what Nathan was getting at there was when you have such a different level of sparsity um you can't just have every GPU have the entire model right the model's too big there's too much complexity there so you have to split up the model um with different types of parallelism right and so you might have different experts on different GPU nodes but now what what happens when a to you know this set of data that you get hey all of it looks like this one way and all of it should route to one part of my you know model right um so so when all of it rout routes to one part of the model then you can have the you can have this overloading of a s certain set of the GPU resources or certain set of the gpus and then the rest of the the training Network sits idle because all of the tokens are just routing to that so this is the biggest complexity one of the big complexities with running a very you know sparse mixture of experts model uh I.E you know this 32 ratio versus this uh four ratio is that you end up with so many of the experts just sitting their Idol so how do I load balance between them how do I schedule the communications between them this is a lot of the like extremely low-level detailed work that they figured out in the public first and potentially like second or third and the world and maybe even first in some cases what uh lesson do you uh in the direction of the better lesson do you take from all of this where is this going to be the direction where a lot of the gain is going to be which is this kind of lowlevel optimization or is this a shortterm thing where the biggest gains will be more on the algorithmic high level side of like posttraining is is this like a short-term leap because they figured out like a hack because constraints Necessities the mother of invention or is is there still a lot of gains I think we should summarize what the bitter lesson actually is about is I the bitter lesson essentially if you paraphrase it is that the types of training that will win out in deep learning as we go are those methods that which are scalable in learning and search is what it calls out and the scale word gets a lot of attention in this the interpretation that I use is effective to avoid adding the human priors to your learning process and if you read the original essay this is what it talks about is how researchers will try to come up with clever solutions to their specific problem that might get them small gains in the short term while simply enabling these deep Learning Systems to work efficiently and for these bigger problems in the long term might be more likely to scale and continue to drive success and therefore we were talking about relatively small implementation changes to the mixture of experts model and therefore it's like okay like we will need a few more years to know if one of these were actually really crucial to the bitter lesson but the bitter lesson is really this long-term Arc of how Simplicity can often win and there's a lot of sayings in the industry like the models just want to learn you have to give them the simple lost landscape where you put compute through the model and and they will learn and get barriers out of the way that that's where the power something like nickel comes in where standardized code that could be used by a lot of people to create sort of simple innovations that can scale which is why the hacks the I imagine the code base for deep seek is probably a giant mess I'm sure they have deep seek definitely has code bases that are extremely messy where they're testing these new ideas multi-head late in attention probably start could start in something like a Jupiter notebook or somebody tries something on a few gpus and that is really messy but the stuff that trains deep seek V3 and deep seek R1 those libraries if you were to present them to us I would guess are extremely high quality code high quality readable code I think there is one aspect to note though right is that there is the general General ability for that to transfer across different types of runs right you may make really really high quality code for one specific model architecture at one size and then that is not transferable to hey when when I make this architecture tweak everything's broken again right like that's that's something that could be uh you know with their with their specific low-l coding of like scheduling SMS is specific to this model architecture and size right and whereas like nvidia's collectives library is more like hey it'll work for anything right you want to do an all reduce great I don't care what your model architecture is it'll work uh and you're giving up a lot of performance when you do that uh in many cases but it's it's worth for them to do the specific uh optimization for the specific run given the constraints that they have regarding compute I wonder how stressful it is to like you know these Frontier models like initiate training like to have the code to push the button that like you're now spending a large amount of money and time to train this like there must I mean there must be a lot of innovation on the debugging stage of like making sure there's no know issues that you're monitoring and visualizing every aspect of the training all that kind of stuff when when people are training they have all these various dashboards but like the most simple one is your loss right uh and it continues to go down but in reality especially with more complicated stuff likee the biggest problem with it or FPA training which is another Innovation you know going to a lower Precision number format I.E less accurate is that you end up with lost bikes right and and no one knows why the Lost bike happen and for long some of them you do some of them you do some of them are data I give a ai's example of what blew up our earlier models is a subreddit called microwave gang we love to shout this out it's a real thing you can pull up microwave gang essentially it's a subreddit where everybody makes posts that are just the letter M so it's like so there's extremely long sequences of the letter M and then the comments are like beep beep because that's when the microwave ends but if you pass this into a model that's trained to be a normal producing text it's extremely high loss because normally you see an M you don't predict M's for a long time so like this is something that causes a l spikes for us but when you have much like this is this is old this is not recent and when you have more mature Data Systems that's not the thing that causes the LW Spike and what Dylan is saying is true but it's like it's it's levels to this sort of idea with regards to the stress right these people are like you know you'll go out to dinner with like a friend that works at one of these labs and they'll just be they'll just be like looking at their phone every like 10 minutes and they're not like you know it's one thing if they're texting but they're just like like is the Lost is the L tokens tokens per second lost not blown up they're just walking watching this and the heart rate goes up if there's a spike and some level of spikes is normal right it'll it'll recover and be back sometimes a lot of the old strategy was like you just stop the run restart from the old version and then like change the data mix and then it keeps going there are even different types of spikes so Durk grenal has a theory A2 that's like Fast spikes and slow spikes where there are sometimes where you're looking at the loss and there other parameters you can see it start to creep up and then blow up and that's really hard to recover from so you have to go back much further so you have the stressful period where it's like flat or might start going up and you're like what do I do whereas there also law spikes that are it looks good and then there's one spiky data point and what you can do is you just skip those you you see that there's a spike you're like okay I can ignore this data don't update the model and do the next one and it'll recover quickly but these like un trickier implementations so as you get more complex in your architecture and you scale up to more gpus you have more potential for your loss blowing up so it's like there there's and there's a distribution the whole idea of grocking also comes in right it's like just because it slowed down from improving and loss doesn't mean it's not learning because all of a sudden it could be like this and it could just Spike down and loss again because it learned truly learned something right uh and it took some time for it to learn that it's not like a gradual process right and that's that's what humans are like that's what models are like so it's it's really a stressful task as you mentioned and the whole time the the the dollar count is going up every company has failed runs you need failed runs to push the envelope on your infrastructure so a lot of news Cycles are made of X company had y failed run every company that's trying to push the frontier of AI has these so is yes it's noteworthy because it's a lot of money and it can be weektoon setback but it is part of the process but how do you get if you're deep seek how do you get to a place where holy there's a successful combination of hyper parameters a lot of small failed runs and so so rapid uh iteration through failed runs until and successful ones you just and then you build a su tuation like this this mixture of expert works and then this implementation of MLA works key hyper parameters like learning rate and regularization and things like this and you find the regime that works for your code base I've talking to people at Frontier Labs there's a story that you can tell where training language models is kind of a path that you need to follow so you need to like unlock the ability to train a certain type of model or a certain scale and then your code base and your internal knoow what type of parameters work for it is kind of known and you look at the Deep seek papers and models they' they've scaled up they've added complexity and it's just continuing to build the capabilities that they have there there's the concept of a YOLO run um so YOLO you only live once um and and what it is is like you know there's there's there's all this experimentation you do at the small scale right uh research ablations right like you have your jupyter notebook whether you're experimenting with MLA on like three gpus or whatever um and you're doing all these different uh things like hey do I do four expert four active experts 128 experts do I arrange the experts this way you know all these different uh model architecture things you're testing at a very small scale right couple researchers few gpus tens of gpus hundreds of gpus whatever it is and then all of a sudden you're like okay guys no more no more around right uh no more screwing around everyone take all the resources we have let's pick what we think will work and just go for it right YOLO and this is where that sort of stress comes in is like well I know it works here but some things that work here don't work here and some things that work here don't work down here right in this terms of scale right so it's it's it's really truly a YOLO run and and sort of like there is this like like discussion of like certain researchers just have like this methodical nature like they can find the whole search space and like figure out all the ablations of different research and really see what is best and there's certain researchers who just kind of like you
Resume
Categories