State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI | Lex Fridman Podcast #490
EV7WhVT270Q • 2026-01-31
Transcript preview
Open
Kind: captions
Language: en
The following is a conversation all
about the state-of-the-art in artificial
intelligence, including some of the
exciting technical breakthroughs and
developments in AI that happened over
the past year and some of the
interesting things we think might happen
this upcoming year. At times it does get
super technical, but we do try to make
sure that it remains accessible to folks
outside the field without ever dumbing
it down. It is a great honor and
pleasure to be able to do this kind of
episode with two of my favorite people
in the AI community, Sebastian Rashka
and Nathan Lambert. They are both widely
respected machine learning researchers
and engineers who also happen to be
great communicators, educators, writers,
and Twitterers exposters.
Sebastian is the author of two books I
highly recommend for beginners and
experts alike. First is build a large
language model from scratch and build a
reasoning model from scratch. I truly
believe in the machine learning computer
science world the best way to learn and
understand something is to build it
yourself from scratch.
Nathan is the post-training lead at the
Allen Institute for AI and author of the
definitive book on reinforcement
learning from human feedback.
Both of them have great exac accounts,
great substacks, Sebastian has courses
on YouTube, Nathan has a podcast, and
everyone should absolutely follow all of
those. This is the Lex Freedman podcast.
to support it. Please check out our
sponsors in the description where you
can also find links to contact me, ask
questions, get feedback, and so on. And
now, dear friends, here's Sebastian
Rashka and Nathan Lambert. So, I think
uh one useful lens to look at all of
this through is the Deep Seek, so-called
Deepseek moment. This happened about a
year ago in January 2025 when the
openweight Chinese company DeepSeek
released Deepseek R1 that uh I think
it's fair to say surprised everyone with
uh near or at state-of-the-art
performance with allegedly much less
compute for much cheaper and from then
to today the AI competition has gotten
insane both on the research level and
the product level it's just been
accelerating. Let's discuss all this
today and maybe let's start with some
spicy questions if we can. Uh who's
winning at the international level?
Would you say it's the set of companies
in China or the set of companies in the
United States? And Sebastian, Nathan,
it's good to see you guys. Uh so
Sebastian, who do you think is winning?
>> Um so winning is a very broad uh you
know term. I I would say you mentioned
the deepseek moment and I do think
deepseek is definitely winning the
hearts of the people who work on open
weight models because they share these
as open models. Um winning I think has
multiple time scales to it. We have
today we have next year we have in 10
years. One thing I know for sure is that
um I don't think nowadays 2026 that
there will be any company who is let's
say having access to a technology that
no other company has access to. And that
is mainly because researchers are
frequently changing jobs, changing labs,
they rotate. So I don't think there will
be a clear winner in terms of technology
access. However, I do think there will
be uh the differentiating factor will be
budget and hardware constraint. So I
don't think the ideas will be
proprietary but the way or the resources
that are needed to implement them and so
I don't see currently take it all
scenario where a winner takes it all I
can't see that at the moment.
>> Uh Nathan, what do you think? you see
the labs put different energy into what
they're trying to do and I think to
demarcate the point in time when we're
recording this. Um the hype over
Anthropics Cloud Opus 4.5 model has been
absolutely insane which is just I mean
I've used it and built stuff in the last
few weeks and it's it's almost gotten to
the point where it feels like a bit of a
meme in terms of the hype. And it's kind
of funny because this is very organic
and then if we go back a few months ago,
we can get the release date in the notes
as Gemini 3 from Google got released and
it seemed like the
marketing and just like wow factor of
that release was super high. But then at
the end of November, Claude Opus 4.5 was
released and the hype has been growing.
But Gemini 3 was before this. And it
kind of feels like people don't really
talk about it as much. Even though when
it came out, everybody was like, "This
is um Gemini's moment to retake kind of
Google's structural advantages in AI."
And Gemini 3 is a fantastic model and I
still use it. It's just kind of
differentiation is lower. And I agree
with Sebastian what you're saying with
all these like the idea space is very
fluid but um culturally anthropic is
known for betting very hard on code
which is cloud code thing is working out
for them right now. So I think that even
if the ideas flow pretty freely so much
of this is bottlenecked by human effort
and kind of culture of organizations
where anthropics seems to at least be
presenting as the least chaotic. is is a
bit of an advantage and if they can keep
doing that for a while. But on the other
side of things, there's a lot of ominous
technology from China where there's way
many more labs than Deep Seek. So Deep
Seek kicked off a movement within China.
I say kind of similar to how Chad GBT
kicked off a movement in the US where
everything had a chatbot. There's now
tons of tech companies in China that are
releasing very strong frontier
openweight models to the point where I
would say that Deep Seek is kind of
losing its crown as the preeminent open
model maker in China. And the likes of
um Z.AI with their GLM models, Miniax's
models, um Kimmy Moonshot, especially in
the last few months, have shown more
brightly. The new Deep Seek models are
still very strong, but that's kind of a
it could look back as a big narrative
point where in 2025 Deep Seek came and
then all and it kind of provided this
platform for way more Chinese companies
that are releasing these fantastic
models to kind of have this new type of
operation. So these models from these
Chinese companies are open weights and
depending on the trajectory of business
models that these American companies are
doing could be at risk. But currently a
lot of people are paying for AI software
in the US and historically in China and
other parts of the world people don't
pay a lot for software.
>> So some of these models like deepseek uh
have the love of the people because they
are open weight. How long do you think
the Chinese companies keep releasing
open weight models?
>> I would say for a few years I think that
like in the US there's not a clear
business model for it. I have been
writing about open models for a while
and these Chinese companies have
realized it. So I get inbound from some
of them and they're smart and realize
the same constraints which is that a lot
of US tech companies and other IT
companies won't pay for a API
subscription to Chinese companies for
security concerns. This has been a
long-standing
um habit in tech and the people at these
companies then see openweight models as
an ability to influence and take part of
a huge growing AI expenditure market in
the US. and they're very realistic about
this and it's working for them and I
think that the government will see that
that is building a lot of influence
internationally in terms of uptake of
the technology. So there's going to be a
lot of incentives to keep it going but
building these models and doing the
research is very expensive. So at some
point I expect consolidation but I don't
expect that to be a story of 2026 where
there will be more open model builders
throughout 2026 than there were in 2025
and a lot of the notable ones will be in
China. you were going to say something.
>> Um, yes. You mentioned Deep Seek losing
its crown. I do think to some extent
yes, but we also have to consider though
they are still I would say slightly
ahead and the other ones it's not that
deep got worse. It's just like the other
ones are using the ideas from Zepseek.
For example, you mentioned Kimmy, same
architecture. They're training it. And
then again, we have this leaprogging
where they might be at some point in
time a bit better because they have the
more recent model. And I think this
comes back to the the fact that there
won't be a clear winner. It's it will
just be like like that and one person
releases something, the other one comes
in. And the the recent the most recent
model is probably always the best model.
>> Yeah. We'll also see the Chinese
companies have different incentives. So
like DeepSeek is very secretive where
some of these startups are like the
Minia Maxes and Z.AI of the world. Those
two literally have filed IPO paperwork
and they're trying to get Western Mind
share and do a lot of outreach there. So
I don't know if these incentives will
kind of change the model development cuz
Deep Seek famously is built by a hedge
fund highf flyier capital and we don't
know exactly what they like. We don't
know what they use the models for or if
they care about this.
>> They're secret in terms of
communication. and they're not secret in
terms of the technical reports that
describe how their models work. They're
still open on that front. And we should
also say on the Opus 45 hype, there's
the layer of uh something
being the darling of the X echo chamber
on Twitter echo chamber and the actual
amount of people that are using the
model. I think it's probably fair to say
that Chbt and Gemini are focused on the
broad user base that just want to solve
problems in their daily lives and that
user base is gigantic. So the hype about
the coding may not be represented the
actual use. I would say also um a lot of
the usage patterns are like you said
name recognition brand uh and and stuff
but also muscle memory almost where um
you know like chipd has been around for
a long time people just got used to
using it and it's kind of like almost
like a flywheel they recommend it to
other users and that stuff one
interesting point is also the
customization of L&Ms for example chip
has a memory feature right and so you
may have a subscription and you use it
for personal stuff but I don't know if
you want to use that same thing at work,
you know, because that's a boundary
between private and work. If you're
working at a company, they might not
allow that or you may not want that. And
I think that's also an interesting point
where you might have multiple
subscriptions. One one is just clean
code. It keeps has nothing of your
personal images that you or hobby
projects in there. It's just like the
work thing and then the other one is
your personal thing. So I think that's
also something where two different use
cases and it doesn't mean you only have
to have one. It's it's I think the
future is also multiple ones.
>> What model do you think won 2025 and
what model do you think is going to win
26? I think in the context of a consumer
chat bots is a question of are you
willing to bet on Gemini over Tatypt
which I would say in my gut feels like a
bit of a risky bet because open AI has
been the incumbent and there's so many
benefits to that in tech but I think the
momentum if you look at 2025 was on
Gemini's side but they were starting
from such a low point I think on RIP
Bard and these earlier attempts of of
getting started I
Huge credit for them for powering
through the organizational chaos to make
that happen. But also, it's hard to bet
against OpenAI because they always come
off as so chaotic, but they're very good
at landing things. And I think like
personally, I have very mixed reviews of
GPT5, but it had to have saved them so
much money with the hideline feature
being a router where most users are no
longer charging like charging their GPU
costs as much. So I think it's very hard
to dissociate the things that I like out
of models versus the things that are
going to actually be a general public
differentiator.
>> What do you think about 2026? Who's
going to win?
>> I'll say something even though it's
risky. I will say that I think Gemini
will continue to take progress on Chad
GPT. I think Google scale when both of
these are operating at such extreme
scales and like Google has the ability
to separate that research and product a
bit better where you hear so much about
open AI being chaotic operationally and
chasing the high impact thing which is a
very startup culture and then on the
software and enterprise side I think
anthropic will have continued to success
as they've again and again been set up
for that and obviously Google's cloud
has a lot of offerings but I think this
kind of like Gemini name brand is
important for them to build and and
Google's cloud will continue to do well,
but that's kind of a more complex thing
to explain in the ecosystem because
that's competing with the likes of Azure
and AWS rather than on the model
provider side. So infrastructure you
think TPUs give an advantage
>> largely because the margin on Nvidia
chips is insane and Google can develop
everything from top to bottom to fit
their stack and not have to pay this
margin and they've had a head start in
building data centers. So all of these
things that have both high lead times
and very high margins on high costs,
Google has a just kind of a historical
advantage there. And if there's going to
be a new paradigm, it's most likely to
come from OpenAI where they're kind of
their research division again and again
has kind of shown this ability to land a
new research idea or a product. I think
like deep research, Sora, 01 thinking
models like all of these definitional
things have come from OpenAI and that's
got to be one of their top traits as an
organization. So it's kind of hard to
bet against that. But I think a lot of
this year will be about scale and
optimizing what could be described as
lowhanging fruit in models.
>> And clearly there's a trade-off between
intelligence and speed. This was what
Chad GPT5
was trying to solve behind the scenes.
It's like do people actually want
intelligence the broad public or do they
want speed? I think it's a nice variety
actually or the option to have a toggle
there. I mean first for my personal
usage most of the time when I look
something up I use JGPD to ask a quick
question get the information I want it
fast for you know most daily tasks I use
the quick model nowadays I think the
auto mode is pretty good where you don't
have to specifically say thinking or you
know non-thinking and stuff then again I
also sometimes want the pro mode very
often what I do is when I have something
written I put it into JBD and say hey do
a very thorough check is are all my
references correct are all my thoughts
it's correct. Uh, did I make any
formatting mistakes? And are the figure
numbers wrong or something like that?
And I don't need that right away. It's
something, okay, I finish my stuff,
maybe have dinner, let it run, come back
and go through this. And I think, see,
this is where I think it's important to
have this option. I would go crazy if
for each query I would have to wait 30
minutes or 10 minutes.
>> That's me.
>> Yeah.
>> Um, I'm like saying over here losing my
mind that you use the router and the
non-thinking model. I'm like, "Oh, how
do you how do you live with how do you
live with that?" It's like my reaction.
I'm been heavily on Chad BT for a while.
Um, never touched five non-thinking. I
find its tone and then it's propensity
of errors. It's just like has a higher
likelihood of errors. Some of this is
from back when openi released 03 which
was the first model to do this deep
search and find many sources and
integrate them for you. So, I became
habituated with that. So, I will only
use GPT 5.2 to thinking or pro when I'm
finding any sort of information query
for work, whether that's a paper or some
code reference that I found and it's
just like I I will regularly have like
five pro queries going simultaneously
each looking for one specific paper or
feedback on an equation or something. I
have a funny example where I just needed
to answer as fast as possible for this
podcast before I was going on the trip.
Um, I have like a local GPU running at
home. And I wanted to run a long uh RL
experiment. And usually I also unplug
things because you never know if you're
not at home, you don't want to have
things plugged in. And I accidentally
unplugged the GP. It was like my wife
was already in the car and it's like,
"Oh, dang." And then basically I wanted
as fast as possible a bash script that
runs my different uh experiments in the
evaluation. And I did something I know.
I learned how to use the bash uh
interface or bash terminal but in that
moment I just needed like 10 seconds
give me the command.
>> This is a hilarious situation but yeah
so what did you use? So I did the
non-thinking fastest model. It gave me
the bash command I to chain different uh
scripts to each other and then the thing
is like you have the T thing where you
want to route this to a lock file. Top
of my head I was just like in a hurry. I
could have thought about it myself.
>> By the way, I don't know if there's a
representative case wife waiting in the
car. You have to run, you know, plug the
GPU. You have to generate a bash script.
It sounds like a movie like Mission
Impossible.
>> I use Gemini for that. So I use thinking
for all the information stuff and then
Gemini for fast things or stuff that I
could sometimes Google which is like
it's good at explaining things and I
trust that it has this kind of
background of knowledge and it's simple
and the Gemini app has got a lot better
and it's good for that sort of things
and then for code and any sort of
philosophical discussion I use claude
opus 4.5 also always with extended
thinking extended thinking and inference
time scaling is just a way to make the
models um marginally smarter and I will
always edge on that side when the
progress is very high because you don't
know when that'll unlock a new use case
and then sometimes use Grock for um
real-time information or finding
something on AI Twitter that I knew I
saw and I need to dig up and I just
fixated on although when Grock 4 came
out the Gro 4 what is super heavy which
was like their pro variant was actually
very good and I was pretty impressed
with it and I just kind of like muscle
memory lost track of it with having the
chatbt app open so I use many different
things. Yeah, I actually do use Gro 4
heavy for debugging for like hardcore
debugging that the other ones can't
solve. I find that it's the best at and
I it's interesting because you say JPT
is the best interface uh for me for that
same reason, but this could be just
Momentum. Uh Gemini
>> is the better interface for me. I think
because I fell in love with their best
needle in the haststack. If I ever put
something that has a lot of context, but
I'm looking for very specific kinds of
information, make sure it tracks all of
it. I find at least uh the Gemini for me
has been uh the best. So, it's funny
with some of these models, if they win
your heart over for one particular
feature at one on a one particular day,
>> for that particular query, that prompt,
you're like, "This model is better." And
so, you'll just stick with it for a bit
until it does something really dumb.
there's like a threshold effect, some
smart thing and then you fall in love
with it and then it does some dumb thing
and you're like, you know what, I'm
going to switch and try claw and try GPT
and all that kind of stuff.
>> This is exactly like you use it until it
breaks until you have a problem and then
then you change uh the LM and I think
it's the same how we use anything like
our favorite text editor um operating
systems or the browser. I mean there are
so many browser options Safari, Firefox,
Chrome, all the relatively similar but
then there are edge cases maybe
extensions you want to use and then you
switch but I don't think there is any
one who types the same thing like the
website into different browsers and
compares them. You only do that when the
website doesn't render if something
breaks I think. So that's that's a good
point. I think you use it until it
breaks and then you explore other
options. I think
>> on the long context thing I was also a
Gemini user for this but the GPT 5.2 to
release blog had like crazy long context
scores where a lot of people were like
did they just figure out some
algorithmic change. It went from like
30% to like 70% or something in this
minor model update. So it's also very
hard to keep track of all of these
things. But now I'm look more favorably
at GPT 5.2's long context. So it's just
kind of like how do I actually get to
testing this
never ending battle. It's interesting
that none of us talked about the Chinese
models from a user usage perspective.
What does that say? Does that mean the
Chinese models are not as good or does
that mean we're just very biased and us
focused? I do think that that's
currently the discrepancy between just
the model and the platform. So I think
the open models they are more known for
the open weights, not their platform
yet.
>> There are also a lot of companies that
are willing to sell you the open model
inference at a very low cost. I think
like open router it's easy to do the
look at multimodel things you could run
deepseeek on perplexity I think all of
us sitting here are like we use openai
GPT5 pro consistently we're all willing
to pay for the marginal intelligence
gain and anyone that's like the these
models from the US are better and in
terms of the outputs I think that the
question is will they stay better for
this year and for years going but it's
like so long as they're better I'm going
to pay for it to use them I think
there's also analysis that shows that
like the
way that the Chinese models are served
this you could argue due to export
controls or not is that they use fewer
GPUs for replica which makes them slower
and have different errors and it's like
the speed and intelligence if these
things are in your favor as a user. I
think in the US a lot of users will go
for this and I think that that is one
thing that will spur these Chinese
companies to want to compete in other
ways whether it's like free or
substantially lower costs or it'll breed
creativity in terms of offerings which
is good for the ecosystem but I just
think the simple thing is US models are
currently better and we use them and I
try Chinese I try these other open
models and I'm like fun but not going to
I don't go back to it. Uh, we didn't
really mention programming. That's
another use case that a lot of people
deeply care about. So, I use basically
half and half cursor and claw code
because there I find them to be like
fundamentally different experience and
both useful. Uh, what do you guys you
program quite a bit. So, what what do
you use? What's the current vibe?
>> So, I use the codeex plugin for VS Code.
You know, it's very convenient. It's
just like a plugin and then it's a chat
interface that has access to your
repository. I know that cloud code is I
think a bit different. It is a bit more
agent. It touches more things. It does a
whole project for you. I'm not quite
there yet where I'm comfortable with
that because uh maybe I'm a control
freak, but I still would like to see a
bit what's going on. And codeex is kind
of like right now for me like the sweet
spot where it is helping me, but it is
not taking completely over. I should
mention one of the reasons I do use
claude code is to build the skill of
programming with English. I mean the
experience is fundamentally different.
You're as opposed to micromanaging the
details of the process of the generation
of the code and uh looking at the diff
which you can in cursor if that's the
idea you use and and then changing
altering looking and reading the code
and understanding the code deeply as you
progress versus just kind of like
thinking in this design space and just
guiding it at this uh macro level which
I think uh is another way of thinking
about the programming process. Also, we
should say that cloud code, it just
seems to be somehow a better utilization
of cloud opus 45.
>> It's a good side by side for people to
do. So, you can have cloud code open,
you can have cursor open, you can have
VS code open, and you can select the
same models on all of them and ask
questions. It's very interesting. Like
the the cloud code is way better in that
domain. It's remarkable. All right, we
should say that both of you are legit on
multiple fronts. Researchers,
programmers, educators, tweeterers,
and on the book front, too. So, Nathan
at some point soon hopefully has an RHF
book coming out.
>> It's available for pre-order, and
there's a full digital preprint. just
making it pretty and better organized
for the physical thing, which is a lot
of why I do it because it's fun to
create things that you think are
excellent in the physical form when so
much of our life is digital. I should
say going to perplexity here, Sebastian
Rashka is a machine learning researcher
and author known for several influential
books. A couple of them that I wanted to
mention, which is a book I highly
recommend, build a large language model
from scratch and the new one, build a
reasoning model from scratch. So, I'm
really excited about that. Building
stuff from scratch is one of the most
powerful ways of learning.
>> Honestly, building an element from
scratch is a lot of fun. It's also a lot
of to learn. And like you said, it's
probably the best way to learn how
something really works cuz you can look
at figures, but figures can have
mistakes. You can look at concepts,
explanations, but you might
misunderstand them. But if you see the
there is code and the code works, you
know it's correct. I mean, there's no
misunderstanding. It's like it's precise
otherwise it wouldn't work. And I think
that's like kind of like the beauty
behind coding. It is kind of like it
doesn't lie. It's math basically. So
even though with math, I think you can
have mistakes in a book. You would never
notice because you're not running the
math when you're reading the book, you
can't verify this. And with code, what's
what's nice is you can verify it.
>> Yeah, I agree with you about the LM from
scratch book. It's nice to tune out
everything else, the internet and so on,
and just focus on the book. But, you
know, I read uh several like, you know,
uh history books. It's just less lonely
somehow. It's really more fun. Like, uh,
for example, on the programming front, I
think it's genuinely more fun to program
with an LLM.
>> And I think it's genuinely more fun to
read with an LLM,
>> but you're right, like this distraction
should be minimized. So it's uh you use
the LLM to basically enrich the
experience, maybe add more context.
Maybe the I just the rate of aha moments
for me in a small scale is really high
with LLM. 100%. I would I also want to
correct myself. I'm not suggesting not
to use LM. Uh I suggest doing it in
multiple passes like one pass just
offline focus mode and then after that
uh I mean I also take notes but I I try
to resist the urge to immediately look
things up. I I do a second pass. It's
just like for me more structured this
way and I get le I mean sometimes things
are answered in the chapter. But
sometimes also it just helps to let it
sink in and think about it. Other people
have different preferences. I would
highly recommend using LLM when reading
books. For me it's just it's not the
first thing to do. It's like the second
pass.
>> By way of recommendation is to say I do
the opposite. I like to use the LLM at
the beginning
>> to lay out the full context of like what
is this world that I'm now stepping
into. But I try to avoid clicking out of
the LLM into the world of like Twitter
and blogs and because then you're now
down this rabbit hole. You're reading
somebody's opinion. there's a flame war
about a particular topic and all a
sudden you're no longer you're now in
the in the realm of the internet and
Reddit and so on. But if you're purely
letting the LLM give you the context of
why this matters, what are the big
picture ideas uh but sometimes books
themselves are good at doing that but
not always. So
>> this is why I like the chat GPT app
because it gives the AI a home in your
computer when you are f you can focus on
it rather than just being another tab in
my mess of internet options and I think
claude code and these particular does a
good job of making that a joy where it
seems very engaging as a product
designed to be an interface that your AI
will then go out into the world and is
something that is very kind of
intangible between it and codeex is that
it just feels kind of warm and engaging
where Codex can often be as good from
open AI but it just kind of like feels a
little bit rougher on the edges whereas
like cloud code is makes it fun to build
things particularly from scratch where
you just don't like you don't have to
care but you trust that it'll make
something like obviously this is good
for websites and kind of refreshing
tooling and stuff like this which I use
it for or data analysis so I my my blog
we scrape hugging paste we keep the
download numbers for every data set and
model over time now so we have them and
it's like cloud was just like yeah I've
made use of that data no problem. And I
was like, that would have taken me days.
And I was like, then I have enough
situational awareness to be like, okay,
these trends obviously make sense and
you can check things. But that's just a
kind of wonderful interface where you
can have an intermediary and not have to
do the kind of awful low-level work that
you would have to do to maintain
different web projects and do this
stuff.
>> All right, so we just talked about a
bunch of the closed weight models. Let's
talk about the open ones. Uh, so tell me
about the landscape of Open LM models.
Which are interesting ones which stand
out to you and why? We already mentioned
Deep Seek.
>> Do you want to see how many we can name
off the top of our heads?
>> Yeah. Yeah. Without looking at notes.
>> Deepseek, Kimmy, Miniaax, Z.A.I.,
Ant, Lang. Are we just going Chinese?
Um, let's throw in Mistral AI, Gemma.
Um,
>> yeah, GPTOSS, the open source model by
Chet GPT. Actually, Nvidia Neimotron had
a or Nvidia had a really cool one, a
Neotron 3. Um, there there's a lot of
stuff, especially at the end of the
year. Quen one may be the one.
>> Oh, yeah. Quen was the name the obvious
name that was I was trying to get
through the You can get at least 10
Chinese and at least 10 Western. I think
that I mean, OpenAI released their first
open model since GPT2. That was when I
when I meant talked when I was writing
about OpenAI's open model release, they
were all like, "Don't forget about
GPT2." Which I thought was really funny
cuz it's just such a different time. But
DP OSS is actually a very strong model
and does some things that the other
models don't do very well. And I think
that selfishly I'll promote a bunch of
like western companies. So both in the
US and Europe have these like fully open
models. So I work at Allen Institute for
AI where we've been building which
releases data and code and all of this.
And now we have actual competition for
people that are trying to release
everything so that other people can
train these models. So there's the
institute for foundation models or LLM
360 which is like had their K2 models of
various types. Apparis is a Swiss
research consortium. Hugging face um has
small LM which is very popular. Um and
NVIDIA's neatron has started releasing
data as well. And then Stanford's Marin
community project which is kind of
making it so there's a pipeline for
people to open a GitHub issue and
implement a new idea and then have it
run in a stable language modeling stack.
So this space
that list was way smaller in 2024. So I
think it was like just AI2. So that's a
great thing for more people to get
involved and to understand language
models which doesn't really have a like
a Chinese company that is has an analog.
While I'm talking, I'll say that the
Chinese open language models tend to be
much bigger and that gives them this
higher peak performance as where a lot
of these things that we like a lot
whether it was Gemma um and Neatron have
tended to be smaller models from the US
which is which is starting to change
from US and Europe. U Mr. large three
came out which was a giant model very
similar to Deepseek architecture in
December and then a startup RCAI and
both Neatron have Neatron and Nvidia
have teased models of this way bigger
than 100 billion parameters like this
400 billion parameter range coming in
this like Q1 2026 timeline. So, I think
this kind of balance is set to change
this year in terms of what people are
using the Chinese versus US open models
for, which will be which I'm personally
gonna be very excited to watch.
>> First of all, huge props for being able
to name so many of these. Did you
actually name Llama?
>> Um, no.
>> I feel like this was not on purpose.
>> RIP Llama.
>> Mhm.
>> All right. Can you mention what are some
interesting models that stand out? So
you mentioned Quen 3 is is is obviously
a standout.
>> So I would say the year is almost
bookended by both DeepSeek version 3 and
R1 and then on the other hand in
December Deepseek version 3.2 because
what I like about those is they always
have an interesting architecture tweak
that others don't have. But otherwise if
you want to go with um you know like the
familiar but really good performance
quen 3 and like um Nathan said also GPD
OSS. And I think GPT OSS what's
interesting about it is kind of like the
first public or like open weight model
that was really trained with tool use in
mind which I do think is kind of a
little bit of a paradigm shift where the
ecosystem was not quite ready for it. So
with tool use I mean that the LLM is
able to do a web search to call a Python
interpreter and I do think this it's a
standout because I think it's a huge
unlock because um one of the most u
common complaints about LLMs are for
example hallucinations right and so in
my opinion one of the best ways to solve
hallucinations is to not try to always
remember information or make things up
for math why not use a calculator app or
Python
>> if I asked the NLM who won the I don't
know soccer world up in 1998. Instead of
just trying to memorize, it could go do
a search. I think mostly it's usually
still a Google search. So JPD, GPOSS,
they would do a tool call to Google,
maybe find the FIFA website, find okay,
it was France. It would get you that
information reliably instead of just
trying to memorize it. So I think it's a
huge unlock which I think right now is
not fully utilized yet by the
open-source openweight ecosystem. A lot
of people don't use tool call modes
because I think it's first is a trust
thing. You don't want to run this on
your computer where it has access to
tools could wipe your hard drive or
whatever. So you want to maybe
containerize that. Um but I do think you
know that that is like a really
important step um for the upcoming years
to have this uh ability. Yeah.
>> So uh a few quick things. First of all,
thank you for defining what you mean by
tool use. I think that's a great thing
to do in general for the concepts we're
talking about. Even things as sort of
wellestablished as
>> uh you have to say that means mixture of
experts and you kind of have to build up
an intuition for people what that means,
how it's actually utilized, what are the
different flavors. So what does it mean
that there's just such explosion of open
models? What's your intuition?
>> If you're releasing an open model, you
want people to use it as the first and
foremost thing. And then after that
comes things like transparency and
trust. I think when you look at China,
the biggest reason is that they want
people around the world to use these
models and I think a lot of people will
not if you look outside of the US a lot
of people will not pay for software but
they might have computing resources
where you can put a model on it and run
it. I think there can also be data that
you don't want to send to the cloud. So
this the the number one thing is getting
people to use models use AI or use your
AI that might not be able to do it
without having access to the model.
>> I guess we should state explicitly. So
we've been talking about these Chinese
models and open weight models often
times the way they're run is locally. So
it's not like you're sending your data
to China or to whoever developed uh to
Silicon Valley whoever developed the
model.
>> A lot of American startups make money by
hosting these models from China and
selling them selling tok. It's called
like selling tokens which means somebody
will call the model to do some some
piece of work. I think the other reason
is for US companies like Chad OpenAI is
so GPU deprived like they're so they're
at the limits of the GPUs whenever they
make a release they're always talking
about like our GPUs are hurting and I
think there's like like in one of these
like GPTOSS release sessions Sam Alman
said like oh we're releasing this
because we can use your GPUs we don't
have to use we don't have to use our
GPUs and OpenAI can still get
distribution out of this which is
another very real thing cuz it's doesn't
cost them though anything and for the
user I think also I mean there are users
who just use the model locally how they
would use uh CHPD but also for companies
I think it's a huge unlock to have these
models because you can customize them
you can train them you can uh add post
training add more data like specialize
them into let's say law medical models
whatever you have and the appeal you
mentioned lama the appeal of the open
weight models from China is that the
open weight models are also the licenses
are even friendlier I think they are
just unrestricted open source licenses
where if you use something like Llama or
Gemma, there are some strings attached.
I think it's like an upper limit in
terms of how many users you have and
then if you exceed I don't know so so
many million users, you have to report
your finance um situation to let's say
meta or something like that and I think
well it is a free model but there are
strings attached and people do like
things where strings are not attached.
So I think that's also one of the
reasons besides performance why the open
weight models from China are so popular
because you you can just use them.
There's no there's no catch in that
sense. Yeah,
>> the ecosystem has gotten better on that
front, but mostly downstream of these
new providers providing such open
licenses. It was funny when you pulled
up perplexity. It said Kimmy K2 thinking
hosted in the US, which is just like an
exact I've never seen this, but it's an
exact example of what we're talking
about where people are sensitive to
this. Like Kimmy K2 thinking and Kimmy
K2 is a model that is very popular.
People say that has very good like
creative writing and also in doing some
software things. There's just these
little quirks that people pick up on
with different models that they like.
>> Uh what are some interesting ideas that
some of these models have explored that
you can speak to like that particular
interesting to you?
>> Maybe we can go chronologically. I mean
there was of course Deepseek um Deepseek
R1 that came out in January. If we just
focus on 2025 however this was based on
Deepseek version 3 which came out the
year um before in December 2024. There
are multiple things on the architecture
side. What is fascinating is you can
still I mean that's what I do in my from
scratch coding projects. You can still
start with GPD2 and you get can add
things to that model to make it into
this other model. So it's all still kind
of like the same lineage the same it is
a very close relationship between those
but top of my head deepsee what was uh
unique there is the mixture of exp I
mean they were not inventing mixture of
experts. We can maybe talk a bit more
what mixture of experts means. Um but
just to list these things first before
we dive into detail. Mixture of experts
but then they also had a multi head
latent attention which is a tweak to the
attention mechanism where this was I
would say 2025 the main distinguishing
factor between these open weight models
different tweaks to make inference or KV
cache size. We can also define KV cache
in a few moments but to kind of make it
more economical to have long context to
shrink the KV cache size. So what are
tweaks um that we can do and most of
them focused on the attention mechanism.
There is multi head latent attention in
in deepseek. There is group query
attention which is still very popular.
It's not invented by any of those
models. It goes back a few years but
that that would be the other option.
Sliding window attention I think almost
reuses it um if I remember correctly. So
there these different tweaks that make
the models different. Otherwise um I put
them all together in an article once
where um I just compared them. They are
very surprisingly similar. It's just
different numbers in terms of how many
repetitions of the transformer block you
have in the center and like just little
little knobs that people tune. But but
what's so nice about it is it's it it
works no matter what. You can tweak
things. You can move the normalization
layers around. You get some performance
gains. And I almost is always very good
in ablation studies showing what
actually what it does to the model if
you move something around. Ablation
studies does it make it better or worse?
But there are so many let's say ways you
can implement a transformer and make it
still work. Big ideas um that are still
prevalent is mixture of experts multi
latent attention um sliding window
attention group query attention and then
at the end of the year we saw a focus on
making the attention mechanism scale
linearly with inference token
prediction. So there were quen 3 next
for example which added a gated delta
net. It's it's like um kind of like
inspired by um state space models where
you have a fixed state that you keep
updating but it makes essentially this
attention
cheaper or it replaces attention with a
cheaper operation
>> and it maybe is it useful to step back
and talk about transform architecture in
general.
>> Yeah. So maybe we should start with the
GPT2 architecture the transformer that
was derived from the attention is all
you need paper.
>> Mhm. So the attention uh is all you need
paper had a transformer architecture
that had two parts an encoder and a
decoder and GPT went just focusing in on
the decoder part. It is essentially a
still a neuronet network um and it has
this attention mechanism inside and you
predict one token at a time. You pass it
through an embedding layer. There's the
transformer block. The transformer block
has attention modules and a fully
connected layer and there are some
normalization layers in between but it's
essentially neuronet network layers with
this attention mechanism. So coming from
GPT2 uh when we move on to GPT OSS there
is for example the mixture of experts um
layer it's not invented by GPOSS it's a
few years old um but it is essentially a
tweak to make the model larger without
consuming more compute in each forward
pass. So there is this fully connected
layer and if listeners are familiar with
um multi-layer perceptrons you can think
of a mini multi-layer perceptron a fully
connected neuronet network layer inside
the transformer and it's very expensive
because it's fully connected if you have
thousand inputs thousand outputs that's
like a 1 million connections and it's a
very expensive part in this transformer
and the idea is to kind of expand that
into multiple feed forward networks. So
instead of having one, let's say you
have 256, but it would make it way more
expensive because now you have 256, but
you don't use all of them at the same
time. So you now have a router that
says, okay, based on this input token,
it would be useful to use this um fully
connected network. And in that context,
it's called an expert. So a mixture of
experts means you have multiple experts.
And depending on what your input is,
let's say it's more math heavy, it would
use different experts compared to let's
say translating input text from English
to Spanish. It would maybe consult
different experts. It's not quite clear,
I mean as clearcut to say, okay, this is
only an expert for math and for Spanish
is a bit more fuzzy. But the idea is
essentially that you pack more knowledge
into the network, but not all the
knowledge is used all the time. That
would be very wasteful. So you're kind
of like during the token generation,
you're more selective. There's a router
that selects which tokens should go to
which expert. Adds more complexity. It's
harder to train. There's a lot of you
know that can go wrong like collapse and
everything. So I think that's why almost
3 still uses uh dense. I mean you have I
think all models with mixture of experts
but dense models where dense means so
also it's jargon. There's a distinction
between dense and sparse. So mixture of
experts is considered sparse because we
have a lot of experts but only few of
them are active. So that's called sparse
and then dense would be the opposite
where you only have like one fully
connected module and it's always you
know utilized. So maybe maybe this is a
good place to also talk about KV cache.
But actually before that even zooming
out like fundamentally how many new
ideas have been implemented from from
GPT2 to today
>> like how different really are these
architectures? Picture like the mixture
of experts um the attention mechanism in
GPToss that would be the group query
attention mechanism. So it's a slight
tweak from multihead attention to group
query attention. So there we have two. I
think they replaced layer norm by RMS
norm, but it's just like a different
normalization layer. Not a big change.
It's just like a tweak. Um the nonlinear
activation function people familiar in
with deep new networks. I mean it's the
same as changing sigmoid with relu. It's
it's not changing the network
fundamentally. It's just like a tweak.
You a little little tweak. Um and that's
about it. I would say it's not really
fundamentally that different. It's still
the same same architecture. So you can
convert one from one uh you can go from
one into the other by just adding these
these changes basically
>> this fundamentally is still the same
architecture.
>> Yep. So for example, you mentioned my
book earlier that's a GPD2 model in the
book because it's simple and it's very
small. Um so 124 120 million parameters
approximately but in the bonus materials
I do have almost three from scratch
gemma 3 from scratch and other types of
from scratch models and I always started
with my GPD2 model and just you know
tweaked a well added different
components and you get from one to the
other. It's like it's kind of like a
lineage in a sense. Yeah. Can you build
up an intuition for people because sort
of when you zoom out you look at it
there's so much rapid advancement in the
AI world and at the same time
fundamentally the architectures have not
changed
>> so where is all the turbulence the
turmoil of the advancement happening
where where's the gains to be had
>> so there are the different stages where
you develop the network um or train the
network you have the pre-training now Um
back in the day it was just pre-training
with GPD2. Now you have pre-training,
mid-training and post-training. Um so I
I think right now we are in the
post-training focus stage. I mean
pre-training still gives you um
advantages if you scale it up to better
higher quality data. But then we have
capability unlocks that were not there
with GPD2. For example uh chat GBT it is
basically a GPT3 model and GPD3 is the
same as GPD2 in terms of architecture.
What was new was adding the um
supervised fine-tuning and the
reinforcement learning with human
feedback. So it's more on the
algorithmic side rather than the
architecture.
>> I would say that the systems also change
a lot. I think if you listen to Nvidia's
announcements, they talk about these
things like you now do FP8, you can now
do FP4. And what is happening is these
labs are figuring out how to utilize
more compute to put it into one model
which lets them train faster and that
lets them put more data in. And then you
can find better configurations faster by
doing this. So you can look at like the
essentially the tokens per second per
GPU is a metric that you look at when
you're doing large scale training and
you could get you can go from like 10k
to 13k by turning on FP8 training which
means you're using less memory per
parameter in the model and by saving
less information you do less
communication you can train faster. So
all of these like system things underpin
way faster experimentation on data and
algorithms that is kind of like it's
this it's this kind of
loop that keeps going where it's kind of
hard to describe when you look at the
architecture and they're exactly the
same but the code base used to train
these models is going to be vastly
different and
>> you could probably like I don't the GPUs
are different but you probably train
GPTOSS 20B way faster and wall clock
time than GPT2 was trained at the time.
Yeah, like you said, they had for
example in the mixture of experts this
NV FP4 optimization for example where
you get more throughput. But I I do
think this is for the speed. This is
true but uh it doesn't giv
Resume
Read
file updated 2026-02-14 08:29:53 UTC
Categories
Manage