State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI | Lex Fridman Podcast #490
EV7WhVT270Q • 2026-01-31
Transcript preview
Open
Kind: captions Language: en The following is a conversation all about the state-of-the-art in artificial intelligence, including some of the exciting technical breakthroughs and developments in AI that happened over the past year and some of the interesting things we think might happen this upcoming year. At times it does get super technical, but we do try to make sure that it remains accessible to folks outside the field without ever dumbing it down. It is a great honor and pleasure to be able to do this kind of episode with two of my favorite people in the AI community, Sebastian Rashka and Nathan Lambert. They are both widely respected machine learning researchers and engineers who also happen to be great communicators, educators, writers, and Twitterers exposters. Sebastian is the author of two books I highly recommend for beginners and experts alike. First is build a large language model from scratch and build a reasoning model from scratch. I truly believe in the machine learning computer science world the best way to learn and understand something is to build it yourself from scratch. Nathan is the post-training lead at the Allen Institute for AI and author of the definitive book on reinforcement learning from human feedback. Both of them have great exac accounts, great substacks, Sebastian has courses on YouTube, Nathan has a podcast, and everyone should absolutely follow all of those. This is the Lex Freedman podcast. to support it. Please check out our sponsors in the description where you can also find links to contact me, ask questions, get feedback, and so on. And now, dear friends, here's Sebastian Rashka and Nathan Lambert. So, I think uh one useful lens to look at all of this through is the Deep Seek, so-called Deepseek moment. This happened about a year ago in January 2025 when the openweight Chinese company DeepSeek released Deepseek R1 that uh I think it's fair to say surprised everyone with uh near or at state-of-the-art performance with allegedly much less compute for much cheaper and from then to today the AI competition has gotten insane both on the research level and the product level it's just been accelerating. Let's discuss all this today and maybe let's start with some spicy questions if we can. Uh who's winning at the international level? Would you say it's the set of companies in China or the set of companies in the United States? And Sebastian, Nathan, it's good to see you guys. Uh so Sebastian, who do you think is winning? >> Um so winning is a very broad uh you know term. I I would say you mentioned the deepseek moment and I do think deepseek is definitely winning the hearts of the people who work on open weight models because they share these as open models. Um winning I think has multiple time scales to it. We have today we have next year we have in 10 years. One thing I know for sure is that um I don't think nowadays 2026 that there will be any company who is let's say having access to a technology that no other company has access to. And that is mainly because researchers are frequently changing jobs, changing labs, they rotate. So I don't think there will be a clear winner in terms of technology access. However, I do think there will be uh the differentiating factor will be budget and hardware constraint. So I don't think the ideas will be proprietary but the way or the resources that are needed to implement them and so I don't see currently take it all scenario where a winner takes it all I can't see that at the moment. >> Uh Nathan, what do you think? you see the labs put different energy into what they're trying to do and I think to demarcate the point in time when we're recording this. Um the hype over Anthropics Cloud Opus 4.5 model has been absolutely insane which is just I mean I've used it and built stuff in the last few weeks and it's it's almost gotten to the point where it feels like a bit of a meme in terms of the hype. And it's kind of funny because this is very organic and then if we go back a few months ago, we can get the release date in the notes as Gemini 3 from Google got released and it seemed like the marketing and just like wow factor of that release was super high. But then at the end of November, Claude Opus 4.5 was released and the hype has been growing. But Gemini 3 was before this. And it kind of feels like people don't really talk about it as much. Even though when it came out, everybody was like, "This is um Gemini's moment to retake kind of Google's structural advantages in AI." And Gemini 3 is a fantastic model and I still use it. It's just kind of differentiation is lower. And I agree with Sebastian what you're saying with all these like the idea space is very fluid but um culturally anthropic is known for betting very hard on code which is cloud code thing is working out for them right now. So I think that even if the ideas flow pretty freely so much of this is bottlenecked by human effort and kind of culture of organizations where anthropics seems to at least be presenting as the least chaotic. is is a bit of an advantage and if they can keep doing that for a while. But on the other side of things, there's a lot of ominous technology from China where there's way many more labs than Deep Seek. So Deep Seek kicked off a movement within China. I say kind of similar to how Chad GBT kicked off a movement in the US where everything had a chatbot. There's now tons of tech companies in China that are releasing very strong frontier openweight models to the point where I would say that Deep Seek is kind of losing its crown as the preeminent open model maker in China. And the likes of um Z.AI with their GLM models, Miniax's models, um Kimmy Moonshot, especially in the last few months, have shown more brightly. The new Deep Seek models are still very strong, but that's kind of a it could look back as a big narrative point where in 2025 Deep Seek came and then all and it kind of provided this platform for way more Chinese companies that are releasing these fantastic models to kind of have this new type of operation. So these models from these Chinese companies are open weights and depending on the trajectory of business models that these American companies are doing could be at risk. But currently a lot of people are paying for AI software in the US and historically in China and other parts of the world people don't pay a lot for software. >> So some of these models like deepseek uh have the love of the people because they are open weight. How long do you think the Chinese companies keep releasing open weight models? >> I would say for a few years I think that like in the US there's not a clear business model for it. I have been writing about open models for a while and these Chinese companies have realized it. So I get inbound from some of them and they're smart and realize the same constraints which is that a lot of US tech companies and other IT companies won't pay for a API subscription to Chinese companies for security concerns. This has been a long-standing um habit in tech and the people at these companies then see openweight models as an ability to influence and take part of a huge growing AI expenditure market in the US. and they're very realistic about this and it's working for them and I think that the government will see that that is building a lot of influence internationally in terms of uptake of the technology. So there's going to be a lot of incentives to keep it going but building these models and doing the research is very expensive. So at some point I expect consolidation but I don't expect that to be a story of 2026 where there will be more open model builders throughout 2026 than there were in 2025 and a lot of the notable ones will be in China. you were going to say something. >> Um, yes. You mentioned Deep Seek losing its crown. I do think to some extent yes, but we also have to consider though they are still I would say slightly ahead and the other ones it's not that deep got worse. It's just like the other ones are using the ideas from Zepseek. For example, you mentioned Kimmy, same architecture. They're training it. And then again, we have this leaprogging where they might be at some point in time a bit better because they have the more recent model. And I think this comes back to the the fact that there won't be a clear winner. It's it will just be like like that and one person releases something, the other one comes in. And the the recent the most recent model is probably always the best model. >> Yeah. We'll also see the Chinese companies have different incentives. So like DeepSeek is very secretive where some of these startups are like the Minia Maxes and Z.AI of the world. Those two literally have filed IPO paperwork and they're trying to get Western Mind share and do a lot of outreach there. So I don't know if these incentives will kind of change the model development cuz Deep Seek famously is built by a hedge fund highf flyier capital and we don't know exactly what they like. We don't know what they use the models for or if they care about this. >> They're secret in terms of communication. and they're not secret in terms of the technical reports that describe how their models work. They're still open on that front. And we should also say on the Opus 45 hype, there's the layer of uh something being the darling of the X echo chamber on Twitter echo chamber and the actual amount of people that are using the model. I think it's probably fair to say that Chbt and Gemini are focused on the broad user base that just want to solve problems in their daily lives and that user base is gigantic. So the hype about the coding may not be represented the actual use. I would say also um a lot of the usage patterns are like you said name recognition brand uh and and stuff but also muscle memory almost where um you know like chipd has been around for a long time people just got used to using it and it's kind of like almost like a flywheel they recommend it to other users and that stuff one interesting point is also the customization of L&Ms for example chip has a memory feature right and so you may have a subscription and you use it for personal stuff but I don't know if you want to use that same thing at work, you know, because that's a boundary between private and work. If you're working at a company, they might not allow that or you may not want that. And I think that's also an interesting point where you might have multiple subscriptions. One one is just clean code. It keeps has nothing of your personal images that you or hobby projects in there. It's just like the work thing and then the other one is your personal thing. So I think that's also something where two different use cases and it doesn't mean you only have to have one. It's it's I think the future is also multiple ones. >> What model do you think won 2025 and what model do you think is going to win 26? I think in the context of a consumer chat bots is a question of are you willing to bet on Gemini over Tatypt which I would say in my gut feels like a bit of a risky bet because open AI has been the incumbent and there's so many benefits to that in tech but I think the momentum if you look at 2025 was on Gemini's side but they were starting from such a low point I think on RIP Bard and these earlier attempts of of getting started I Huge credit for them for powering through the organizational chaos to make that happen. But also, it's hard to bet against OpenAI because they always come off as so chaotic, but they're very good at landing things. And I think like personally, I have very mixed reviews of GPT5, but it had to have saved them so much money with the hideline feature being a router where most users are no longer charging like charging their GPU costs as much. So I think it's very hard to dissociate the things that I like out of models versus the things that are going to actually be a general public differentiator. >> What do you think about 2026? Who's going to win? >> I'll say something even though it's risky. I will say that I think Gemini will continue to take progress on Chad GPT. I think Google scale when both of these are operating at such extreme scales and like Google has the ability to separate that research and product a bit better where you hear so much about open AI being chaotic operationally and chasing the high impact thing which is a very startup culture and then on the software and enterprise side I think anthropic will have continued to success as they've again and again been set up for that and obviously Google's cloud has a lot of offerings but I think this kind of like Gemini name brand is important for them to build and and Google's cloud will continue to do well, but that's kind of a more complex thing to explain in the ecosystem because that's competing with the likes of Azure and AWS rather than on the model provider side. So infrastructure you think TPUs give an advantage >> largely because the margin on Nvidia chips is insane and Google can develop everything from top to bottom to fit their stack and not have to pay this margin and they've had a head start in building data centers. So all of these things that have both high lead times and very high margins on high costs, Google has a just kind of a historical advantage there. And if there's going to be a new paradigm, it's most likely to come from OpenAI where they're kind of their research division again and again has kind of shown this ability to land a new research idea or a product. I think like deep research, Sora, 01 thinking models like all of these definitional things have come from OpenAI and that's got to be one of their top traits as an organization. So it's kind of hard to bet against that. But I think a lot of this year will be about scale and optimizing what could be described as lowhanging fruit in models. >> And clearly there's a trade-off between intelligence and speed. This was what Chad GPT5 was trying to solve behind the scenes. It's like do people actually want intelligence the broad public or do they want speed? I think it's a nice variety actually or the option to have a toggle there. I mean first for my personal usage most of the time when I look something up I use JGPD to ask a quick question get the information I want it fast for you know most daily tasks I use the quick model nowadays I think the auto mode is pretty good where you don't have to specifically say thinking or you know non-thinking and stuff then again I also sometimes want the pro mode very often what I do is when I have something written I put it into JBD and say hey do a very thorough check is are all my references correct are all my thoughts it's correct. Uh, did I make any formatting mistakes? And are the figure numbers wrong or something like that? And I don't need that right away. It's something, okay, I finish my stuff, maybe have dinner, let it run, come back and go through this. And I think, see, this is where I think it's important to have this option. I would go crazy if for each query I would have to wait 30 minutes or 10 minutes. >> That's me. >> Yeah. >> Um, I'm like saying over here losing my mind that you use the router and the non-thinking model. I'm like, "Oh, how do you how do you live with how do you live with that?" It's like my reaction. I'm been heavily on Chad BT for a while. Um, never touched five non-thinking. I find its tone and then it's propensity of errors. It's just like has a higher likelihood of errors. Some of this is from back when openi released 03 which was the first model to do this deep search and find many sources and integrate them for you. So, I became habituated with that. So, I will only use GPT 5.2 to thinking or pro when I'm finding any sort of information query for work, whether that's a paper or some code reference that I found and it's just like I I will regularly have like five pro queries going simultaneously each looking for one specific paper or feedback on an equation or something. I have a funny example where I just needed to answer as fast as possible for this podcast before I was going on the trip. Um, I have like a local GPU running at home. And I wanted to run a long uh RL experiment. And usually I also unplug things because you never know if you're not at home, you don't want to have things plugged in. And I accidentally unplugged the GP. It was like my wife was already in the car and it's like, "Oh, dang." And then basically I wanted as fast as possible a bash script that runs my different uh experiments in the evaluation. And I did something I know. I learned how to use the bash uh interface or bash terminal but in that moment I just needed like 10 seconds give me the command. >> This is a hilarious situation but yeah so what did you use? So I did the non-thinking fastest model. It gave me the bash command I to chain different uh scripts to each other and then the thing is like you have the T thing where you want to route this to a lock file. Top of my head I was just like in a hurry. I could have thought about it myself. >> By the way, I don't know if there's a representative case wife waiting in the car. You have to run, you know, plug the GPU. You have to generate a bash script. It sounds like a movie like Mission Impossible. >> I use Gemini for that. So I use thinking for all the information stuff and then Gemini for fast things or stuff that I could sometimes Google which is like it's good at explaining things and I trust that it has this kind of background of knowledge and it's simple and the Gemini app has got a lot better and it's good for that sort of things and then for code and any sort of philosophical discussion I use claude opus 4.5 also always with extended thinking extended thinking and inference time scaling is just a way to make the models um marginally smarter and I will always edge on that side when the progress is very high because you don't know when that'll unlock a new use case and then sometimes use Grock for um real-time information or finding something on AI Twitter that I knew I saw and I need to dig up and I just fixated on although when Grock 4 came out the Gro 4 what is super heavy which was like their pro variant was actually very good and I was pretty impressed with it and I just kind of like muscle memory lost track of it with having the chatbt app open so I use many different things. Yeah, I actually do use Gro 4 heavy for debugging for like hardcore debugging that the other ones can't solve. I find that it's the best at and I it's interesting because you say JPT is the best interface uh for me for that same reason, but this could be just Momentum. Uh Gemini >> is the better interface for me. I think because I fell in love with their best needle in the haststack. If I ever put something that has a lot of context, but I'm looking for very specific kinds of information, make sure it tracks all of it. I find at least uh the Gemini for me has been uh the best. So, it's funny with some of these models, if they win your heart over for one particular feature at one on a one particular day, >> for that particular query, that prompt, you're like, "This model is better." And so, you'll just stick with it for a bit until it does something really dumb. there's like a threshold effect, some smart thing and then you fall in love with it and then it does some dumb thing and you're like, you know what, I'm going to switch and try claw and try GPT and all that kind of stuff. >> This is exactly like you use it until it breaks until you have a problem and then then you change uh the LM and I think it's the same how we use anything like our favorite text editor um operating systems or the browser. I mean there are so many browser options Safari, Firefox, Chrome, all the relatively similar but then there are edge cases maybe extensions you want to use and then you switch but I don't think there is any one who types the same thing like the website into different browsers and compares them. You only do that when the website doesn't render if something breaks I think. So that's that's a good point. I think you use it until it breaks and then you explore other options. I think >> on the long context thing I was also a Gemini user for this but the GPT 5.2 to release blog had like crazy long context scores where a lot of people were like did they just figure out some algorithmic change. It went from like 30% to like 70% or something in this minor model update. So it's also very hard to keep track of all of these things. But now I'm look more favorably at GPT 5.2's long context. So it's just kind of like how do I actually get to testing this never ending battle. It's interesting that none of us talked about the Chinese models from a user usage perspective. What does that say? Does that mean the Chinese models are not as good or does that mean we're just very biased and us focused? I do think that that's currently the discrepancy between just the model and the platform. So I think the open models they are more known for the open weights, not their platform yet. >> There are also a lot of companies that are willing to sell you the open model inference at a very low cost. I think like open router it's easy to do the look at multimodel things you could run deepseeek on perplexity I think all of us sitting here are like we use openai GPT5 pro consistently we're all willing to pay for the marginal intelligence gain and anyone that's like the these models from the US are better and in terms of the outputs I think that the question is will they stay better for this year and for years going but it's like so long as they're better I'm going to pay for it to use them I think there's also analysis that shows that like the way that the Chinese models are served this you could argue due to export controls or not is that they use fewer GPUs for replica which makes them slower and have different errors and it's like the speed and intelligence if these things are in your favor as a user. I think in the US a lot of users will go for this and I think that that is one thing that will spur these Chinese companies to want to compete in other ways whether it's like free or substantially lower costs or it'll breed creativity in terms of offerings which is good for the ecosystem but I just think the simple thing is US models are currently better and we use them and I try Chinese I try these other open models and I'm like fun but not going to I don't go back to it. Uh, we didn't really mention programming. That's another use case that a lot of people deeply care about. So, I use basically half and half cursor and claw code because there I find them to be like fundamentally different experience and both useful. Uh, what do you guys you program quite a bit. So, what what do you use? What's the current vibe? >> So, I use the codeex plugin for VS Code. You know, it's very convenient. It's just like a plugin and then it's a chat interface that has access to your repository. I know that cloud code is I think a bit different. It is a bit more agent. It touches more things. It does a whole project for you. I'm not quite there yet where I'm comfortable with that because uh maybe I'm a control freak, but I still would like to see a bit what's going on. And codeex is kind of like right now for me like the sweet spot where it is helping me, but it is not taking completely over. I should mention one of the reasons I do use claude code is to build the skill of programming with English. I mean the experience is fundamentally different. You're as opposed to micromanaging the details of the process of the generation of the code and uh looking at the diff which you can in cursor if that's the idea you use and and then changing altering looking and reading the code and understanding the code deeply as you progress versus just kind of like thinking in this design space and just guiding it at this uh macro level which I think uh is another way of thinking about the programming process. Also, we should say that cloud code, it just seems to be somehow a better utilization of cloud opus 45. >> It's a good side by side for people to do. So, you can have cloud code open, you can have cursor open, you can have VS code open, and you can select the same models on all of them and ask questions. It's very interesting. Like the the cloud code is way better in that domain. It's remarkable. All right, we should say that both of you are legit on multiple fronts. Researchers, programmers, educators, tweeterers, and on the book front, too. So, Nathan at some point soon hopefully has an RHF book coming out. >> It's available for pre-order, and there's a full digital preprint. just making it pretty and better organized for the physical thing, which is a lot of why I do it because it's fun to create things that you think are excellent in the physical form when so much of our life is digital. I should say going to perplexity here, Sebastian Rashka is a machine learning researcher and author known for several influential books. A couple of them that I wanted to mention, which is a book I highly recommend, build a large language model from scratch and the new one, build a reasoning model from scratch. So, I'm really excited about that. Building stuff from scratch is one of the most powerful ways of learning. >> Honestly, building an element from scratch is a lot of fun. It's also a lot of to learn. And like you said, it's probably the best way to learn how something really works cuz you can look at figures, but figures can have mistakes. You can look at concepts, explanations, but you might misunderstand them. But if you see the there is code and the code works, you know it's correct. I mean, there's no misunderstanding. It's like it's precise otherwise it wouldn't work. And I think that's like kind of like the beauty behind coding. It is kind of like it doesn't lie. It's math basically. So even though with math, I think you can have mistakes in a book. You would never notice because you're not running the math when you're reading the book, you can't verify this. And with code, what's what's nice is you can verify it. >> Yeah, I agree with you about the LM from scratch book. It's nice to tune out everything else, the internet and so on, and just focus on the book. But, you know, I read uh several like, you know, uh history books. It's just less lonely somehow. It's really more fun. Like, uh, for example, on the programming front, I think it's genuinely more fun to program with an LLM. >> And I think it's genuinely more fun to read with an LLM, >> but you're right, like this distraction should be minimized. So it's uh you use the LLM to basically enrich the experience, maybe add more context. Maybe the I just the rate of aha moments for me in a small scale is really high with LLM. 100%. I would I also want to correct myself. I'm not suggesting not to use LM. Uh I suggest doing it in multiple passes like one pass just offline focus mode and then after that uh I mean I also take notes but I I try to resist the urge to immediately look things up. I I do a second pass. It's just like for me more structured this way and I get le I mean sometimes things are answered in the chapter. But sometimes also it just helps to let it sink in and think about it. Other people have different preferences. I would highly recommend using LLM when reading books. For me it's just it's not the first thing to do. It's like the second pass. >> By way of recommendation is to say I do the opposite. I like to use the LLM at the beginning >> to lay out the full context of like what is this world that I'm now stepping into. But I try to avoid clicking out of the LLM into the world of like Twitter and blogs and because then you're now down this rabbit hole. You're reading somebody's opinion. there's a flame war about a particular topic and all a sudden you're no longer you're now in the in the realm of the internet and Reddit and so on. But if you're purely letting the LLM give you the context of why this matters, what are the big picture ideas uh but sometimes books themselves are good at doing that but not always. So >> this is why I like the chat GPT app because it gives the AI a home in your computer when you are f you can focus on it rather than just being another tab in my mess of internet options and I think claude code and these particular does a good job of making that a joy where it seems very engaging as a product designed to be an interface that your AI will then go out into the world and is something that is very kind of intangible between it and codeex is that it just feels kind of warm and engaging where Codex can often be as good from open AI but it just kind of like feels a little bit rougher on the edges whereas like cloud code is makes it fun to build things particularly from scratch where you just don't like you don't have to care but you trust that it'll make something like obviously this is good for websites and kind of refreshing tooling and stuff like this which I use it for or data analysis so I my my blog we scrape hugging paste we keep the download numbers for every data set and model over time now so we have them and it's like cloud was just like yeah I've made use of that data no problem. And I was like, that would have taken me days. And I was like, then I have enough situational awareness to be like, okay, these trends obviously make sense and you can check things. But that's just a kind of wonderful interface where you can have an intermediary and not have to do the kind of awful low-level work that you would have to do to maintain different web projects and do this stuff. >> All right, so we just talked about a bunch of the closed weight models. Let's talk about the open ones. Uh, so tell me about the landscape of Open LM models. Which are interesting ones which stand out to you and why? We already mentioned Deep Seek. >> Do you want to see how many we can name off the top of our heads? >> Yeah. Yeah. Without looking at notes. >> Deepseek, Kimmy, Miniaax, Z.A.I., Ant, Lang. Are we just going Chinese? Um, let's throw in Mistral AI, Gemma. Um, >> yeah, GPTOSS, the open source model by Chet GPT. Actually, Nvidia Neimotron had a or Nvidia had a really cool one, a Neotron 3. Um, there there's a lot of stuff, especially at the end of the year. Quen one may be the one. >> Oh, yeah. Quen was the name the obvious name that was I was trying to get through the You can get at least 10 Chinese and at least 10 Western. I think that I mean, OpenAI released their first open model since GPT2. That was when I when I meant talked when I was writing about OpenAI's open model release, they were all like, "Don't forget about GPT2." Which I thought was really funny cuz it's just such a different time. But DP OSS is actually a very strong model and does some things that the other models don't do very well. And I think that selfishly I'll promote a bunch of like western companies. So both in the US and Europe have these like fully open models. So I work at Allen Institute for AI where we've been building which releases data and code and all of this. And now we have actual competition for people that are trying to release everything so that other people can train these models. So there's the institute for foundation models or LLM 360 which is like had their K2 models of various types. Apparis is a Swiss research consortium. Hugging face um has small LM which is very popular. Um and NVIDIA's neatron has started releasing data as well. And then Stanford's Marin community project which is kind of making it so there's a pipeline for people to open a GitHub issue and implement a new idea and then have it run in a stable language modeling stack. So this space that list was way smaller in 2024. So I think it was like just AI2. So that's a great thing for more people to get involved and to understand language models which doesn't really have a like a Chinese company that is has an analog. While I'm talking, I'll say that the Chinese open language models tend to be much bigger and that gives them this higher peak performance as where a lot of these things that we like a lot whether it was Gemma um and Neatron have tended to be smaller models from the US which is which is starting to change from US and Europe. U Mr. large three came out which was a giant model very similar to Deepseek architecture in December and then a startup RCAI and both Neatron have Neatron and Nvidia have teased models of this way bigger than 100 billion parameters like this 400 billion parameter range coming in this like Q1 2026 timeline. So, I think this kind of balance is set to change this year in terms of what people are using the Chinese versus US open models for, which will be which I'm personally gonna be very excited to watch. >> First of all, huge props for being able to name so many of these. Did you actually name Llama? >> Um, no. >> I feel like this was not on purpose. >> RIP Llama. >> Mhm. >> All right. Can you mention what are some interesting models that stand out? So you mentioned Quen 3 is is is obviously a standout. >> So I would say the year is almost bookended by both DeepSeek version 3 and R1 and then on the other hand in December Deepseek version 3.2 because what I like about those is they always have an interesting architecture tweak that others don't have. But otherwise if you want to go with um you know like the familiar but really good performance quen 3 and like um Nathan said also GPD OSS. And I think GPT OSS what's interesting about it is kind of like the first public or like open weight model that was really trained with tool use in mind which I do think is kind of a little bit of a paradigm shift where the ecosystem was not quite ready for it. So with tool use I mean that the LLM is able to do a web search to call a Python interpreter and I do think this it's a standout because I think it's a huge unlock because um one of the most u common complaints about LLMs are for example hallucinations right and so in my opinion one of the best ways to solve hallucinations is to not try to always remember information or make things up for math why not use a calculator app or Python >> if I asked the NLM who won the I don't know soccer world up in 1998. Instead of just trying to memorize, it could go do a search. I think mostly it's usually still a Google search. So JPD, GPOSS, they would do a tool call to Google, maybe find the FIFA website, find okay, it was France. It would get you that information reliably instead of just trying to memorize it. So I think it's a huge unlock which I think right now is not fully utilized yet by the open-source openweight ecosystem. A lot of people don't use tool call modes because I think it's first is a trust thing. You don't want to run this on your computer where it has access to tools could wipe your hard drive or whatever. So you want to maybe containerize that. Um but I do think you know that that is like a really important step um for the upcoming years to have this uh ability. Yeah. >> So uh a few quick things. First of all, thank you for defining what you mean by tool use. I think that's a great thing to do in general for the concepts we're talking about. Even things as sort of wellestablished as >> uh you have to say that means mixture of experts and you kind of have to build up an intuition for people what that means, how it's actually utilized, what are the different flavors. So what does it mean that there's just such explosion of open models? What's your intuition? >> If you're releasing an open model, you want people to use it as the first and foremost thing. And then after that comes things like transparency and trust. I think when you look at China, the biggest reason is that they want people around the world to use these models and I think a lot of people will not if you look outside of the US a lot of people will not pay for software but they might have computing resources where you can put a model on it and run it. I think there can also be data that you don't want to send to the cloud. So this the the number one thing is getting people to use models use AI or use your AI that might not be able to do it without having access to the model. >> I guess we should state explicitly. So we've been talking about these Chinese models and open weight models often times the way they're run is locally. So it's not like you're sending your data to China or to whoever developed uh to Silicon Valley whoever developed the model. >> A lot of American startups make money by hosting these models from China and selling them selling tok. It's called like selling tokens which means somebody will call the model to do some some piece of work. I think the other reason is for US companies like Chad OpenAI is so GPU deprived like they're so they're at the limits of the GPUs whenever they make a release they're always talking about like our GPUs are hurting and I think there's like like in one of these like GPTOSS release sessions Sam Alman said like oh we're releasing this because we can use your GPUs we don't have to use we don't have to use our GPUs and OpenAI can still get distribution out of this which is another very real thing cuz it's doesn't cost them though anything and for the user I think also I mean there are users who just use the model locally how they would use uh CHPD but also for companies I think it's a huge unlock to have these models because you can customize them you can train them you can uh add post training add more data like specialize them into let's say law medical models whatever you have and the appeal you mentioned lama the appeal of the open weight models from China is that the open weight models are also the licenses are even friendlier I think they are just unrestricted open source licenses where if you use something like Llama or Gemma, there are some strings attached. I think it's like an upper limit in terms of how many users you have and then if you exceed I don't know so so many million users, you have to report your finance um situation to let's say meta or something like that and I think well it is a free model but there are strings attached and people do like things where strings are not attached. So I think that's also one of the reasons besides performance why the open weight models from China are so popular because you you can just use them. There's no there's no catch in that sense. Yeah, >> the ecosystem has gotten better on that front, but mostly downstream of these new providers providing such open licenses. It was funny when you pulled up perplexity. It said Kimmy K2 thinking hosted in the US, which is just like an exact I've never seen this, but it's an exact example of what we're talking about where people are sensitive to this. Like Kimmy K2 thinking and Kimmy K2 is a model that is very popular. People say that has very good like creative writing and also in doing some software things. There's just these little quirks that people pick up on with different models that they like. >> Uh what are some interesting ideas that some of these models have explored that you can speak to like that particular interesting to you? >> Maybe we can go chronologically. I mean there was of course Deepseek um Deepseek R1 that came out in January. If we just focus on 2025 however this was based on Deepseek version 3 which came out the year um before in December 2024. There are multiple things on the architecture side. What is fascinating is you can still I mean that's what I do in my from scratch coding projects. You can still start with GPD2 and you get can add things to that model to make it into this other model. So it's all still kind of like the same lineage the same it is a very close relationship between those but top of my head deepsee what was uh unique there is the mixture of exp I mean they were not inventing mixture of experts. We can maybe talk a bit more what mixture of experts means. Um but just to list these things first before we dive into detail. Mixture of experts but then they also had a multi head latent attention which is a tweak to the attention mechanism where this was I would say 2025 the main distinguishing factor between these open weight models different tweaks to make inference or KV cache size. We can also define KV cache in a few moments but to kind of make it more economical to have long context to shrink the KV cache size. So what are tweaks um that we can do and most of them focused on the attention mechanism. There is multi head latent attention in in deepseek. There is group query attention which is still very popular. It's not invented by any of those models. It goes back a few years but that that would be the other option. Sliding window attention I think almost reuses it um if I remember correctly. So there these different tweaks that make the models different. Otherwise um I put them all together in an article once where um I just compared them. They are very surprisingly similar. It's just different numbers in terms of how many repetitions of the transformer block you have in the center and like just little little knobs that people tune. But but what's so nice about it is it's it it works no matter what. You can tweak things. You can move the normalization layers around. You get some performance gains. And I almost is always very good in ablation studies showing what actually what it does to the model if you move something around. Ablation studies does it make it better or worse? But there are so many let's say ways you can implement a transformer and make it still work. Big ideas um that are still prevalent is mixture of experts multi latent attention um sliding window attention group query attention and then at the end of the year we saw a focus on making the attention mechanism scale linearly with inference token prediction. So there were quen 3 next for example which added a gated delta net. It's it's like um kind of like inspired by um state space models where you have a fixed state that you keep updating but it makes essentially this attention cheaper or it replaces attention with a cheaper operation >> and it maybe is it useful to step back and talk about transform architecture in general. >> Yeah. So maybe we should start with the GPT2 architecture the transformer that was derived from the attention is all you need paper. >> Mhm. So the attention uh is all you need paper had a transformer architecture that had two parts an encoder and a decoder and GPT went just focusing in on the decoder part. It is essentially a still a neuronet network um and it has this attention mechanism inside and you predict one token at a time. You pass it through an embedding layer. There's the transformer block. The transformer block has attention modules and a fully connected layer and there are some normalization layers in between but it's essentially neuronet network layers with this attention mechanism. So coming from GPT2 uh when we move on to GPT OSS there is for example the mixture of experts um layer it's not invented by GPOSS it's a few years old um but it is essentially a tweak to make the model larger without consuming more compute in each forward pass. So there is this fully connected layer and if listeners are familiar with um multi-layer perceptrons you can think of a mini multi-layer perceptron a fully connected neuronet network layer inside the transformer and it's very expensive because it's fully connected if you have thousand inputs thousand outputs that's like a 1 million connections and it's a very expensive part in this transformer and the idea is to kind of expand that into multiple feed forward networks. So instead of having one, let's say you have 256, but it would make it way more expensive because now you have 256, but you don't use all of them at the same time. So you now have a router that says, okay, based on this input token, it would be useful to use this um fully connected network. And in that context, it's called an expert. So a mixture of experts means you have multiple experts. And depending on what your input is, let's say it's more math heavy, it would use different experts compared to let's say translating input text from English to Spanish. It would maybe consult different experts. It's not quite clear, I mean as clearcut to say, okay, this is only an expert for math and for Spanish is a bit more fuzzy. But the idea is essentially that you pack more knowledge into the network, but not all the knowledge is used all the time. That would be very wasteful. So you're kind of like during the token generation, you're more selective. There's a router that selects which tokens should go to which expert. Adds more complexity. It's harder to train. There's a lot of you know that can go wrong like collapse and everything. So I think that's why almost 3 still uses uh dense. I mean you have I think all models with mixture of experts but dense models where dense means so also it's jargon. There's a distinction between dense and sparse. So mixture of experts is considered sparse because we have a lot of experts but only few of them are active. So that's called sparse and then dense would be the opposite where you only have like one fully connected module and it's always you know utilized. So maybe maybe this is a good place to also talk about KV cache. But actually before that even zooming out like fundamentally how many new ideas have been implemented from from GPT2 to today >> like how different really are these architectures? Picture like the mixture of experts um the attention mechanism in GPToss that would be the group query attention mechanism. So it's a slight tweak from multihead attention to group query attention. So there we have two. I think they replaced layer norm by RMS norm, but it's just like a different normalization layer. Not a big change. It's just like a tweak. Um the nonlinear activation function people familiar in with deep new networks. I mean it's the same as changing sigmoid with relu. It's it's not changing the network fundamentally. It's just like a tweak. You a little little tweak. Um and that's about it. I would say it's not really fundamentally that different. It's still the same same architecture. So you can convert one from one uh you can go from one into the other by just adding these these changes basically >> this fundamentally is still the same architecture. >> Yep. So for example, you mentioned my book earlier that's a GPD2 model in the book because it's simple and it's very small. Um so 124 120 million parameters approximately but in the bonus materials I do have almost three from scratch gemma 3 from scratch and other types of from scratch models and I always started with my GPD2 model and just you know tweaked a well added different components and you get from one to the other. It's like it's kind of like a lineage in a sense. Yeah. Can you build up an intuition for people because sort of when you zoom out you look at it there's so much rapid advancement in the AI world and at the same time fundamentally the architectures have not changed >> so where is all the turbulence the turmoil of the advancement happening where where's the gains to be had >> so there are the different stages where you develop the network um or train the network you have the pre-training now Um back in the day it was just pre-training with GPD2. Now you have pre-training, mid-training and post-training. Um so I I think right now we are in the post-training focus stage. I mean pre-training still gives you um advantages if you scale it up to better higher quality data. But then we have capability unlocks that were not there with GPD2. For example uh chat GBT it is basically a GPT3 model and GPD3 is the same as GPD2 in terms of architecture. What was new was adding the um supervised fine-tuning and the reinforcement learning with human feedback. So it's more on the algorithmic side rather than the architecture. >> I would say that the systems also change a lot. I think if you listen to Nvidia's announcements, they talk about these things like you now do FP8, you can now do FP4. And what is happening is these labs are figuring out how to utilize more compute to put it into one model which lets them train faster and that lets them put more data in. And then you can find better configurations faster by doing this. So you can look at like the essentially the tokens per second per GPU is a metric that you look at when you're doing large scale training and you could get you can go from like 10k to 13k by turning on FP8 training which means you're using less memory per parameter in the model and by saving less information you do less communication you can train faster. So all of these like system things underpin way faster experimentation on data and algorithms that is kind of like it's this it's this kind of loop that keeps going where it's kind of hard to describe when you look at the architecture and they're exactly the same but the code base used to train these models is going to be vastly different and >> you could probably like I don't the GPUs are different but you probably train GPTOSS 20B way faster and wall clock time than GPT2 was trained at the time. Yeah, like you said, they had for example in the mixture of experts this NV FP4 optimization for example where you get more throughput. But I I do think this is for the speed. This is true but uh it doesn't giv
Resume
Categories