Transcript

EV7WhVT270Q • State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI | Lex Fridman Podcast #490
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0845_EV7WhVT270Q.txt
Back Raw
Kind: captions
Language: en
The following is a conversation all
about the state-of-the-art in artificial
intelligence, including some of the
exciting technical breakthroughs and
developments in AI that happened over
the past year and some of the
interesting things we think might happen
this upcoming year. At times it does get
super technical, but we do try to make
sure that it remains accessible to folks
outside the field without ever dumbing
it down. It is a great honor and
pleasure to be able to do this kind of
episode with two of my favorite people
in the AI community, Sebastian Rashka
and Nathan Lambert. They are both widely
respected machine learning researchers
and engineers who also happen to be
great communicators, educators, writers,
and Twitterers exposters.
Sebastian is the author of two books I
highly recommend for beginners and
experts alike. First is build a large
language model from scratch and build a
reasoning model from scratch. I truly
believe in the machine learning computer
science world the best way to learn and
understand something is to build it
yourself from scratch.
Nathan is the post-training lead at the
Allen Institute for AI and author of the
definitive book on reinforcement
learning from human feedback.
Both of them have great exac accounts,
great substacks, Sebastian has courses
on YouTube, Nathan has a podcast, and
everyone should absolutely follow all of
those. This is the Lex Freedman podcast.
to support it. Please check out our
sponsors in the description where you
can also find links to contact me, ask
questions, get feedback, and so on. And
now, dear friends, here's Sebastian
Rashka and Nathan Lambert. So, I think
uh one useful lens to look at all of
this through is the Deep Seek, so-called
Deepseek moment. This happened about a
year ago in January 2025 when the
openweight Chinese company DeepSeek
released Deepseek R1 that uh I think
it's fair to say surprised everyone with
uh near or at state-of-the-art
performance with allegedly much less
compute for much cheaper and from then
to today the AI competition has gotten
insane both on the research level and
the product level it's just been
accelerating. Let's discuss all this
today and maybe let's start with some
spicy questions if we can. Uh who's
winning at the international level?
Would you say it's the set of companies
in China or the set of companies in the
United States? And Sebastian, Nathan,
it's good to see you guys. Uh so
Sebastian, who do you think is winning?
&gt;&gt; Um so winning is a very broad uh you
know term. I I would say you mentioned
the deepseek moment and I do think
deepseek is definitely winning the
hearts of the people who work on open
weight models because they share these
as open models. Um winning I think has
multiple time scales to it. We have
today we have next year we have in 10
years. One thing I know for sure is that
um I don't think nowadays 2026 that
there will be any company who is let's
say having access to a technology that
no other company has access to. And that
is mainly because researchers are
frequently changing jobs, changing labs,
they rotate. So I don't think there will
be a clear winner in terms of technology
access. However, I do think there will
be uh the differentiating factor will be
budget and hardware constraint. So I
don't think the ideas will be
proprietary but the way or the resources
that are needed to implement them and so
I don't see currently take it all
scenario where a winner takes it all I
can't see that at the moment.
&gt;&gt; Uh Nathan, what do you think? you see
the labs put different energy into what
they're trying to do and I think to
demarcate the point in time when we're
recording this. Um the hype over
Anthropics Cloud Opus 4.5 model has been
absolutely insane which is just I mean
I've used it and built stuff in the last
few weeks and it's it's almost gotten to
the point where it feels like a bit of a
meme in terms of the hype. And it's kind
of funny because this is very organic
and then if we go back a few months ago,
we can get the release date in the notes
as Gemini 3 from Google got released and
it seemed like the
marketing and just like wow factor of
that release was super high. But then at
the end of November, Claude Opus 4.5 was
released and the hype has been growing.
But Gemini 3 was before this. And it
kind of feels like people don't really
talk about it as much. Even though when
it came out, everybody was like, "This
is um Gemini's moment to retake kind of
Google's structural advantages in AI."
And Gemini 3 is a fantastic model and I
still use it. It's just kind of
differentiation is lower. And I agree
with Sebastian what you're saying with
all these like the idea space is very
fluid but um culturally anthropic is
known for betting very hard on code
which is cloud code thing is working out
for them right now. So I think that even
if the ideas flow pretty freely so much
of this is bottlenecked by human effort
and kind of culture of organizations
where anthropics seems to at least be
presenting as the least chaotic. is is a
bit of an advantage and if they can keep
doing that for a while. But on the other
side of things, there's a lot of ominous
technology from China where there's way
many more labs than Deep Seek. So Deep
Seek kicked off a movement within China.
I say kind of similar to how Chad GBT
kicked off a movement in the US where
everything had a chatbot. There's now
tons of tech companies in China that are
releasing very strong frontier
openweight models to the point where I
would say that Deep Seek is kind of
losing its crown as the preeminent open
model maker in China. And the likes of
um Z.AI with their GLM models, Miniax's
models, um Kimmy Moonshot, especially in
the last few months, have shown more
brightly. The new Deep Seek models are
still very strong, but that's kind of a
it could look back as a big narrative
point where in 2025 Deep Seek came and
then all and it kind of provided this
platform for way more Chinese companies
that are releasing these fantastic
models to kind of have this new type of
operation. So these models from these
Chinese companies are open weights and
depending on the trajectory of business
models that these American companies are
doing could be at risk. But currently a
lot of people are paying for AI software
in the US and historically in China and
other parts of the world people don't
pay a lot for software.
&gt;&gt; So some of these models like deepseek uh
have the love of the people because they
are open weight. How long do you think
the Chinese companies keep releasing
open weight models?
&gt;&gt; I would say for a few years I think that
like in the US there's not a clear
business model for it. I have been
writing about open models for a while
and these Chinese companies have
realized it. So I get inbound from some
of them and they're smart and realize
the same constraints which is that a lot
of US tech companies and other IT
companies won't pay for a API
subscription to Chinese companies for
security concerns. This has been a
long-standing
um habit in tech and the people at these
companies then see openweight models as
an ability to influence and take part of
a huge growing AI expenditure market in
the US. and they're very realistic about
this and it's working for them and I
think that the government will see that
that is building a lot of influence
internationally in terms of uptake of
the technology. So there's going to be a
lot of incentives to keep it going but
building these models and doing the
research is very expensive. So at some
point I expect consolidation but I don't
expect that to be a story of 2026 where
there will be more open model builders
throughout 2026 than there were in 2025
and a lot of the notable ones will be in
China. you were going to say something.
&gt;&gt; Um, yes. You mentioned Deep Seek losing
its crown. I do think to some extent
yes, but we also have to consider though
they are still I would say slightly
ahead and the other ones it's not that
deep got worse. It's just like the other
ones are using the ideas from Zepseek.
For example, you mentioned Kimmy, same
architecture. They're training it. And
then again, we have this leaprogging
where they might be at some point in
time a bit better because they have the
more recent model. And I think this
comes back to the the fact that there
won't be a clear winner. It's it will
just be like like that and one person
releases something, the other one comes
in. And the the recent the most recent
model is probably always the best model.
&gt;&gt; Yeah. We'll also see the Chinese
companies have different incentives. So
like DeepSeek is very secretive where
some of these startups are like the
Minia Maxes and Z.AI of the world. Those
two literally have filed IPO paperwork
and they're trying to get Western Mind
share and do a lot of outreach there. So
I don't know if these incentives will
kind of change the model development cuz
Deep Seek famously is built by a hedge
fund highf flyier capital and we don't
know exactly what they like. We don't
know what they use the models for or if
they care about this.
&gt;&gt; They're secret in terms of
communication. and they're not secret in
terms of the technical reports that
describe how their models work. They're
still open on that front. And we should
also say on the Opus 45 hype, there's
the layer of uh something
being the darling of the X echo chamber
on Twitter echo chamber and the actual
amount of people that are using the
model. I think it's probably fair to say
that Chbt and Gemini are focused on the
broad user base that just want to solve
problems in their daily lives and that
user base is gigantic. So the hype about
the coding may not be represented the
actual use. I would say also um a lot of
the usage patterns are like you said
name recognition brand uh and and stuff
but also muscle memory almost where um
you know like chipd has been around for
a long time people just got used to
using it and it's kind of like almost
like a flywheel they recommend it to
other users and that stuff one
interesting point is also the
customization of L&Ms for example chip
has a memory feature right and so you
may have a subscription and you use it
for personal stuff but I don't know if
you want to use that same thing at work,
you know, because that's a boundary
between private and work. If you're
working at a company, they might not
allow that or you may not want that. And
I think that's also an interesting point
where you might have multiple
subscriptions. One one is just clean
code. It keeps has nothing of your
personal images that you or hobby
projects in there. It's just like the
work thing and then the other one is
your personal thing. So I think that's
also something where two different use
cases and it doesn't mean you only have
to have one. It's it's I think the
future is also multiple ones.
&gt;&gt; What model do you think won 2025 and
what model do you think is going to win
26? I think in the context of a consumer
chat bots is a question of are you
willing to bet on Gemini over Tatypt
which I would say in my gut feels like a
bit of a risky bet because open AI has
been the incumbent and there's so many
benefits to that in tech but I think the
momentum if you look at 2025 was on
Gemini's side but they were starting
from such a low point I think on RIP
Bard and these earlier attempts of of
getting started I
Huge credit for them for powering
through the organizational chaos to make
that happen. But also, it's hard to bet
against OpenAI because they always come
off as so chaotic, but they're very good
at landing things. And I think like
personally, I have very mixed reviews of
GPT5, but it had to have saved them so
much money with the hideline feature
being a router where most users are no
longer charging like charging their GPU
costs as much. So I think it's very hard
to dissociate the things that I like out
of models versus the things that are
going to actually be a general public
differentiator.
&gt;&gt; What do you think about 2026? Who's
going to win?
&gt;&gt; I'll say something even though it's
risky. I will say that I think Gemini
will continue to take progress on Chad
GPT. I think Google scale when both of
these are operating at such extreme
scales and like Google has the ability
to separate that research and product a
bit better where you hear so much about
open AI being chaotic operationally and
chasing the high impact thing which is a
very startup culture and then on the
software and enterprise side I think
anthropic will have continued to success
as they've again and again been set up
for that and obviously Google's cloud
has a lot of offerings but I think this
kind of like Gemini name brand is
important for them to build and and
Google's cloud will continue to do well,
but that's kind of a more complex thing
to explain in the ecosystem because
that's competing with the likes of Azure
and AWS rather than on the model
provider side. So infrastructure you
think TPUs give an advantage
&gt;&gt; largely because the margin on Nvidia
chips is insane and Google can develop
everything from top to bottom to fit
their stack and not have to pay this
margin and they've had a head start in
building data centers. So all of these
things that have both high lead times
and very high margins on high costs,
Google has a just kind of a historical
advantage there. And if there's going to
be a new paradigm, it's most likely to
come from OpenAI where they're kind of
their research division again and again
has kind of shown this ability to land a
new research idea or a product. I think
like deep research, Sora, 01 thinking
models like all of these definitional
things have come from OpenAI and that's
got to be one of their top traits as an
organization. So it's kind of hard to
bet against that. But I think a lot of
this year will be about scale and
optimizing what could be described as
lowhanging fruit in models.
&gt;&gt; And clearly there's a trade-off between
intelligence and speed. This was what
Chad GPT5
was trying to solve behind the scenes.
It's like do people actually want
intelligence the broad public or do they
want speed? I think it's a nice variety
actually or the option to have a toggle
there. I mean first for my personal
usage most of the time when I look
something up I use JGPD to ask a quick
question get the information I want it
fast for you know most daily tasks I use
the quick model nowadays I think the
auto mode is pretty good where you don't
have to specifically say thinking or you
know non-thinking and stuff then again I
also sometimes want the pro mode very
often what I do is when I have something
written I put it into JBD and say hey do
a very thorough check is are all my
references correct are all my thoughts
it's correct. Uh, did I make any
formatting mistakes? And are the figure
numbers wrong or something like that?
And I don't need that right away. It's
something, okay, I finish my stuff,
maybe have dinner, let it run, come back
and go through this. And I think, see,
this is where I think it's important to
have this option. I would go crazy if
for each query I would have to wait 30
minutes or 10 minutes.
&gt;&gt; That's me.
&gt;&gt; Yeah.
&gt;&gt; Um, I'm like saying over here losing my
mind that you use the router and the
non-thinking model. I'm like, "Oh, how
do you how do you live with how do you
live with that?" It's like my reaction.
I'm been heavily on Chad BT for a while.
Um, never touched five non-thinking. I
find its tone and then it's propensity
of errors. It's just like has a higher
likelihood of errors. Some of this is
from back when openi released 03 which
was the first model to do this deep
search and find many sources and
integrate them for you. So, I became
habituated with that. So, I will only
use GPT 5.2 to thinking or pro when I'm
finding any sort of information query
for work, whether that's a paper or some
code reference that I found and it's
just like I I will regularly have like
five pro queries going simultaneously
each looking for one specific paper or
feedback on an equation or something. I
have a funny example where I just needed
to answer as fast as possible for this
podcast before I was going on the trip.
Um, I have like a local GPU running at
home. And I wanted to run a long uh RL
experiment. And usually I also unplug
things because you never know if you're
not at home, you don't want to have
things plugged in. And I accidentally
unplugged the GP. It was like my wife
was already in the car and it's like,
"Oh, dang." And then basically I wanted
as fast as possible a bash script that
runs my different uh experiments in the
evaluation. And I did something I know.
I learned how to use the bash uh
interface or bash terminal but in that
moment I just needed like 10 seconds
give me the command.
&gt;&gt; This is a hilarious situation but yeah
so what did you use? So I did the
non-thinking fastest model. It gave me
the bash command I to chain different uh
scripts to each other and then the thing
is like you have the T thing where you
want to route this to a lock file. Top
of my head I was just like in a hurry. I
could have thought about it myself.
&gt;&gt; By the way, I don't know if there's a
representative case wife waiting in the
car. You have to run, you know, plug the
GPU. You have to generate a bash script.
It sounds like a movie like Mission
Impossible.
&gt;&gt; I use Gemini for that. So I use thinking
for all the information stuff and then
Gemini for fast things or stuff that I
could sometimes Google which is like
it's good at explaining things and I
trust that it has this kind of
background of knowledge and it's simple
and the Gemini app has got a lot better
and it's good for that sort of things
and then for code and any sort of
philosophical discussion I use claude
opus 4.5 also always with extended
thinking extended thinking and inference
time scaling is just a way to make the
models um marginally smarter and I will
always edge on that side when the
progress is very high because you don't
know when that'll unlock a new use case
and then sometimes use Grock for um
real-time information or finding
something on AI Twitter that I knew I
saw and I need to dig up and I just
fixated on although when Grock 4 came
out the Gro 4 what is super heavy which
was like their pro variant was actually
very good and I was pretty impressed
with it and I just kind of like muscle
memory lost track of it with having the
chatbt app open so I use many different
things. Yeah, I actually do use Gro 4
heavy for debugging for like hardcore
debugging that the other ones can't
solve. I find that it's the best at and
I it's interesting because you say JPT
is the best interface uh for me for that
same reason, but this could be just
Momentum. Uh Gemini
&gt;&gt; is the better interface for me. I think
because I fell in love with their best
needle in the haststack. If I ever put
something that has a lot of context, but
I'm looking for very specific kinds of
information, make sure it tracks all of
it. I find at least uh the Gemini for me
has been uh the best. So, it's funny
with some of these models, if they win
your heart over for one particular
feature at one on a one particular day,
&gt;&gt; for that particular query, that prompt,
you're like, "This model is better." And
so, you'll just stick with it for a bit
until it does something really dumb.
there's like a threshold effect, some
smart thing and then you fall in love
with it and then it does some dumb thing
and you're like, you know what, I'm
going to switch and try claw and try GPT
and all that kind of stuff.
&gt;&gt; This is exactly like you use it until it
breaks until you have a problem and then
then you change uh the LM and I think
it's the same how we use anything like
our favorite text editor um operating
systems or the browser. I mean there are
so many browser options Safari, Firefox,
Chrome, all the relatively similar but
then there are edge cases maybe
extensions you want to use and then you
switch but I don't think there is any
one who types the same thing like the
website into different browsers and
compares them. You only do that when the
website doesn't render if something
breaks I think. So that's that's a good
point. I think you use it until it
breaks and then you explore other
options. I think
&gt;&gt; on the long context thing I was also a
Gemini user for this but the GPT 5.2 to
release blog had like crazy long context
scores where a lot of people were like
did they just figure out some
algorithmic change. It went from like
30% to like 70% or something in this
minor model update. So it's also very
hard to keep track of all of these
things. But now I'm look more favorably
at GPT 5.2's long context. So it's just
kind of like how do I actually get to
testing this
never ending battle. It's interesting
that none of us talked about the Chinese
models from a user usage perspective.
What does that say? Does that mean the
Chinese models are not as good or does
that mean we're just very biased and us
focused? I do think that that's
currently the discrepancy between just
the model and the platform. So I think
the open models they are more known for
the open weights, not their platform
yet.
&gt;&gt; There are also a lot of companies that
are willing to sell you the open model
inference at a very low cost. I think
like open router it's easy to do the
look at multimodel things you could run
deepseeek on perplexity I think all of
us sitting here are like we use openai
GPT5 pro consistently we're all willing
to pay for the marginal intelligence
gain and anyone that's like the these
models from the US are better and in
terms of the outputs I think that the
question is will they stay better for
this year and for years going but it's
like so long as they're better I'm going
to pay for it to use them I think
there's also analysis that shows that
like the
way that the Chinese models are served
this you could argue due to export
controls or not is that they use fewer
GPUs for replica which makes them slower
and have different errors and it's like
the speed and intelligence if these
things are in your favor as a user. I
think in the US a lot of users will go
for this and I think that that is one
thing that will spur these Chinese
companies to want to compete in other
ways whether it's like free or
substantially lower costs or it'll breed
creativity in terms of offerings which
is good for the ecosystem but I just
think the simple thing is US models are
currently better and we use them and I
try Chinese I try these other open
models and I'm like fun but not going to
I don't go back to it. Uh, we didn't
really mention programming. That's
another use case that a lot of people
deeply care about. So, I use basically
half and half cursor and claw code
because there I find them to be like
fundamentally different experience and
both useful. Uh, what do you guys you
program quite a bit. So, what what do
you use? What's the current vibe?
&gt;&gt; So, I use the codeex plugin for VS Code.
You know, it's very convenient. It's
just like a plugin and then it's a chat
interface that has access to your
repository. I know that cloud code is I
think a bit different. It is a bit more
agent. It touches more things. It does a
whole project for you. I'm not quite
there yet where I'm comfortable with
that because uh maybe I'm a control
freak, but I still would like to see a
bit what's going on. And codeex is kind
of like right now for me like the sweet
spot where it is helping me, but it is
not taking completely over. I should
mention one of the reasons I do use
claude code is to build the skill of
programming with English. I mean the
experience is fundamentally different.
You're as opposed to micromanaging the
details of the process of the generation
of the code and uh looking at the diff
which you can in cursor if that's the
idea you use and and then changing
altering looking and reading the code
and understanding the code deeply as you
progress versus just kind of like
thinking in this design space and just
guiding it at this uh macro level which
I think uh is another way of thinking
about the programming process. Also, we
should say that cloud code, it just
seems to be somehow a better utilization
of cloud opus 45.
&gt;&gt; It's a good side by side for people to
do. So, you can have cloud code open,
you can have cursor open, you can have
VS code open, and you can select the
same models on all of them and ask
questions. It's very interesting. Like
the the cloud code is way better in that
domain. It's remarkable. All right, we
should say that both of you are legit on
multiple fronts. Researchers,
programmers, educators, tweeterers,
and on the book front, too. So, Nathan
at some point soon hopefully has an RHF
book coming out.
&gt;&gt; It's available for pre-order, and
there's a full digital preprint. just
making it pretty and better organized
for the physical thing, which is a lot
of why I do it because it's fun to
create things that you think are
excellent in the physical form when so
much of our life is digital. I should
say going to perplexity here, Sebastian
Rashka is a machine learning researcher
and author known for several influential
books. A couple of them that I wanted to
mention, which is a book I highly
recommend, build a large language model
from scratch and the new one, build a
reasoning model from scratch. So, I'm
really excited about that. Building
stuff from scratch is one of the most
powerful ways of learning.
&gt;&gt; Honestly, building an element from
scratch is a lot of fun. It's also a lot
of to learn. And like you said, it's
probably the best way to learn how
something really works cuz you can look
at figures, but figures can have
mistakes. You can look at concepts,
explanations, but you might
misunderstand them. But if you see the
there is code and the code works, you
know it's correct. I mean, there's no
misunderstanding. It's like it's precise
otherwise it wouldn't work. And I think
that's like kind of like the beauty
behind coding. It is kind of like it
doesn't lie. It's math basically. So
even though with math, I think you can
have mistakes in a book. You would never
notice because you're not running the
math when you're reading the book, you
can't verify this. And with code, what's
what's nice is you can verify it.
&gt;&gt; Yeah, I agree with you about the LM from
scratch book. It's nice to tune out
everything else, the internet and so on,
and just focus on the book. But, you
know, I read uh several like, you know,
uh history books. It's just less lonely
somehow. It's really more fun. Like, uh,
for example, on the programming front, I
think it's genuinely more fun to program
with an LLM.
&gt;&gt; And I think it's genuinely more fun to
read with an LLM,
&gt;&gt; but you're right, like this distraction
should be minimized. So it's uh you use
the LLM to basically enrich the
experience, maybe add more context.
Maybe the I just the rate of aha moments
for me in a small scale is really high
with LLM. 100%. I would I also want to
correct myself. I'm not suggesting not
to use LM. Uh I suggest doing it in
multiple passes like one pass just
offline focus mode and then after that
uh I mean I also take notes but I I try
to resist the urge to immediately look
things up. I I do a second pass. It's
just like for me more structured this
way and I get le I mean sometimes things
are answered in the chapter. But
sometimes also it just helps to let it
sink in and think about it. Other people
have different preferences. I would
highly recommend using LLM when reading
books. For me it's just it's not the
first thing to do. It's like the second
pass.
&gt;&gt; By way of recommendation is to say I do
the opposite. I like to use the LLM at
the beginning
&gt;&gt; to lay out the full context of like what
is this world that I'm now stepping
into. But I try to avoid clicking out of
the LLM into the world of like Twitter
and blogs and because then you're now
down this rabbit hole. You're reading
somebody's opinion. there's a flame war
about a particular topic and all a
sudden you're no longer you're now in
the in the realm of the internet and
Reddit and so on. But if you're purely
letting the LLM give you the context of
why this matters, what are the big
picture ideas uh but sometimes books
themselves are good at doing that but
not always. So
&gt;&gt; this is why I like the chat GPT app
because it gives the AI a home in your
computer when you are f you can focus on
it rather than just being another tab in
my mess of internet options and I think
claude code and these particular does a
good job of making that a joy where it
seems very engaging as a product
designed to be an interface that your AI
will then go out into the world and is
something that is very kind of
intangible between it and codeex is that
it just feels kind of warm and engaging
where Codex can often be as good from
open AI but it just kind of like feels a
little bit rougher on the edges whereas
like cloud code is makes it fun to build
things particularly from scratch where
you just don't like you don't have to
care but you trust that it'll make
something like obviously this is good
for websites and kind of refreshing
tooling and stuff like this which I use
it for or data analysis so I my my blog
we scrape hugging paste we keep the
download numbers for every data set and
model over time now so we have them and
it's like cloud was just like yeah I've
made use of that data no problem. And I
was like, that would have taken me days.
And I was like, then I have enough
situational awareness to be like, okay,
these trends obviously make sense and
you can check things. But that's just a
kind of wonderful interface where you
can have an intermediary and not have to
do the kind of awful low-level work that
you would have to do to maintain
different web projects and do this
stuff.
&gt;&gt; All right, so we just talked about a
bunch of the closed weight models. Let's
talk about the open ones. Uh, so tell me
about the landscape of Open LM models.
Which are interesting ones which stand
out to you and why? We already mentioned
Deep Seek.
&gt;&gt; Do you want to see how many we can name
off the top of our heads?
&gt;&gt; Yeah. Yeah. Without looking at notes.
&gt;&gt; Deepseek, Kimmy, Miniaax, Z.A.I.,
Ant, Lang. Are we just going Chinese?
Um, let's throw in Mistral AI, Gemma.
Um,
&gt;&gt; yeah, GPTOSS, the open source model by
Chet GPT. Actually, Nvidia Neimotron had
a or Nvidia had a really cool one, a
Neotron 3. Um, there there's a lot of
stuff, especially at the end of the
year. Quen one may be the one.
&gt;&gt; Oh, yeah. Quen was the name the obvious
name that was I was trying to get
through the You can get at least 10
Chinese and at least 10 Western. I think
that I mean, OpenAI released their first
open model since GPT2. That was when I
when I meant talked when I was writing
about OpenAI's open model release, they
were all like, "Don't forget about
GPT2." Which I thought was really funny
cuz it's just such a different time. But
DP OSS is actually a very strong model
and does some things that the other
models don't do very well. And I think
that selfishly I'll promote a bunch of
like western companies. So both in the
US and Europe have these like fully open
models. So I work at Allen Institute for
AI where we've been building which
releases data and code and all of this.
And now we have actual competition for
people that are trying to release
everything so that other people can
train these models. So there's the
institute for foundation models or LLM
360 which is like had their K2 models of
various types. Apparis is a Swiss
research consortium. Hugging face um has
small LM which is very popular. Um and
NVIDIA's neatron has started releasing
data as well. And then Stanford's Marin
community project which is kind of
making it so there's a pipeline for
people to open a GitHub issue and
implement a new idea and then have it
run in a stable language modeling stack.
So this space
that list was way smaller in 2024. So I
think it was like just AI2. So that's a
great thing for more people to get
involved and to understand language
models which doesn't really have a like
a Chinese company that is has an analog.
While I'm talking, I'll say that the
Chinese open language models tend to be
much bigger and that gives them this
higher peak performance as where a lot
of these things that we like a lot
whether it was Gemma um and Neatron have
tended to be smaller models from the US
which is which is starting to change
from US and Europe. U Mr. large three
came out which was a giant model very
similar to Deepseek architecture in
December and then a startup RCAI and
both Neatron have Neatron and Nvidia
have teased models of this way bigger
than 100 billion parameters like this
400 billion parameter range coming in
this like Q1 2026 timeline. So, I think
this kind of balance is set to change
this year in terms of what people are
using the Chinese versus US open models
for, which will be which I'm personally
gonna be very excited to watch.
&gt;&gt; First of all, huge props for being able
to name so many of these. Did you
actually name Llama?
&gt;&gt; Um, no.
&gt;&gt; I feel like this was not on purpose.
&gt;&gt; RIP Llama.
&gt;&gt; Mhm.
&gt;&gt; All right. Can you mention what are some
interesting models that stand out? So
you mentioned Quen 3 is is is obviously
a standout.
&gt;&gt; So I would say the year is almost
bookended by both DeepSeek version 3 and
R1 and then on the other hand in
December Deepseek version 3.2 because
what I like about those is they always
have an interesting architecture tweak
that others don't have. But otherwise if
you want to go with um you know like the
familiar but really good performance
quen 3 and like um Nathan said also GPD
OSS. And I think GPT OSS what's
interesting about it is kind of like the
first public or like open weight model
that was really trained with tool use in
mind which I do think is kind of a
little bit of a paradigm shift where the
ecosystem was not quite ready for it. So
with tool use I mean that the LLM is
able to do a web search to call a Python
interpreter and I do think this it's a
standout because I think it's a huge
unlock because um one of the most u
common complaints about LLMs are for
example hallucinations right and so in
my opinion one of the best ways to solve
hallucinations is to not try to always
remember information or make things up
for math why not use a calculator app or
Python
&gt;&gt; if I asked the NLM who won the I don't
know soccer world up in 1998. Instead of
just trying to memorize, it could go do
a search. I think mostly it's usually
still a Google search. So JPD, GPOSS,
they would do a tool call to Google,
maybe find the FIFA website, find okay,
it was France. It would get you that
information reliably instead of just
trying to memorize it. So I think it's a
huge unlock which I think right now is
not fully utilized yet by the
open-source openweight ecosystem. A lot
of people don't use tool call modes
because I think it's first is a trust
thing. You don't want to run this on
your computer where it has access to
tools could wipe your hard drive or
whatever. So you want to maybe
containerize that. Um but I do think you
know that that is like a really
important step um for the upcoming years
to have this uh ability. Yeah.
&gt;&gt; So uh a few quick things. First of all,
thank you for defining what you mean by
tool use. I think that's a great thing
to do in general for the concepts we're
talking about. Even things as sort of
wellestablished as
&gt;&gt; uh you have to say that means mixture of
experts and you kind of have to build up
an intuition for people what that means,
how it's actually utilized, what are the
different flavors. So what does it mean
that there's just such explosion of open
models? What's your intuition?
&gt;&gt; If you're releasing an open model, you
want people to use it as the first and
foremost thing. And then after that
comes things like transparency and
trust. I think when you look at China,
the biggest reason is that they want
people around the world to use these
models and I think a lot of people will
not if you look outside of the US a lot
of people will not pay for software but
they might have computing resources
where you can put a model on it and run
it. I think there can also be data that
you don't want to send to the cloud. So
this the the number one thing is getting
people to use models use AI or use your
AI that might not be able to do it
without having access to the model.
&gt;&gt; I guess we should state explicitly. So
we've been talking about these Chinese
models and open weight models often
times the way they're run is locally. So
it's not like you're sending your data
to China or to whoever developed uh to
Silicon Valley whoever developed the
model.
&gt;&gt; A lot of American startups make money by
hosting these models from China and
selling them selling tok. It's called
like selling tokens which means somebody
will call the model to do some some
piece of work. I think the other reason
is for US companies like Chad OpenAI is
so GPU deprived like they're so they're
at the limits of the GPUs whenever they
make a release they're always talking
about like our GPUs are hurting and I
think there's like like in one of these
like GPTOSS release sessions Sam Alman
said like oh we're releasing this
because we can use your GPUs we don't
have to use we don't have to use our
GPUs and OpenAI can still get
distribution out of this which is
another very real thing cuz it's doesn't
cost them though anything and for the
user I think also I mean there are users
who just use the model locally how they
would use uh CHPD but also for companies
I think it's a huge unlock to have these
models because you can customize them
you can train them you can uh add post
training add more data like specialize
them into let's say law medical models
whatever you have and the appeal you
mentioned lama the appeal of the open
weight models from China is that the
open weight models are also the licenses
are even friendlier I think they are
just unrestricted open source licenses
where if you use something like Llama or
Gemma, there are some strings attached.
I think it's like an upper limit in
terms of how many users you have and
then if you exceed I don't know so so
many million users, you have to report
your finance um situation to let's say
meta or something like that and I think
well it is a free model but there are
strings attached and people do like
things where strings are not attached.
So I think that's also one of the
reasons besides performance why the open
weight models from China are so popular
because you you can just use them.
There's no there's no catch in that
sense. Yeah,
&gt;&gt; the ecosystem has gotten better on that
front, but mostly downstream of these
new providers providing such open
licenses. It was funny when you pulled
up perplexity. It said Kimmy K2 thinking
hosted in the US, which is just like an
exact I've never seen this, but it's an
exact example of what we're talking
about where people are sensitive to
this. Like Kimmy K2 thinking and Kimmy
K2 is a model that is very popular.
People say that has very good like
creative writing and also in doing some
software things. There's just these
little quirks that people pick up on
with different models that they like.
&gt;&gt; Uh what are some interesting ideas that
some of these models have explored that
you can speak to like that particular
interesting to you?
&gt;&gt; Maybe we can go chronologically. I mean
there was of course Deepseek um Deepseek
R1 that came out in January. If we just
focus on 2025 however this was based on
Deepseek version 3 which came out the
year um before in December 2024. There
are multiple things on the architecture
side. What is fascinating is you can
still I mean that's what I do in my from
scratch coding projects. You can still
start with GPD2 and you get can add
things to that model to make it into
this other model. So it's all still kind
of like the same lineage the same it is
a very close relationship between those
but top of my head deepsee what was uh
unique there is the mixture of exp I
mean they were not inventing mixture of
experts. We can maybe talk a bit more
what mixture of experts means. Um but
just to list these things first before
we dive into detail. Mixture of experts
but then they also had a multi head
latent attention which is a tweak to the
attention mechanism where this was I
would say 2025 the main distinguishing
factor between these open weight models
different tweaks to make inference or KV
cache size. We can also define KV cache
in a few moments but to kind of make it
more economical to have long context to
shrink the KV cache size. So what are
tweaks um that we can do and most of
them focused on the attention mechanism.
There is multi head latent attention in
in deepseek. There is group query
attention which is still very popular.
It's not invented by any of those
models. It goes back a few years but
that that would be the other option.
Sliding window attention I think almost
reuses it um if I remember correctly. So
there these different tweaks that make
the models different. Otherwise um I put
them all together in an article once
where um I just compared them. They are
very surprisingly similar. It's just
different numbers in terms of how many
repetitions of the transformer block you
have in the center and like just little
little knobs that people tune. But but
what's so nice about it is it's it it
works no matter what. You can tweak
things. You can move the normalization
layers around. You get some performance
gains. And I almost is always very good
in ablation studies showing what
actually what it does to the model if
you move something around. Ablation
studies does it make it better or worse?
But there are so many let's say ways you
can implement a transformer and make it
still work. Big ideas um that are still
prevalent is mixture of experts multi
latent attention um sliding window
attention group query attention and then
at the end of the year we saw a focus on
making the attention mechanism scale
linearly with inference token
prediction. So there were quen 3 next
for example which added a gated delta
net. It's it's like um kind of like
inspired by um state space models where
you have a fixed state that you keep
updating but it makes essentially this
attention
cheaper or it replaces attention with a
cheaper operation
&gt;&gt; and it maybe is it useful to step back
and talk about transform architecture in
general.
&gt;&gt; Yeah. So maybe we should start with the
GPT2 architecture the transformer that
was derived from the attention is all
you need paper.
&gt;&gt; Mhm. So the attention uh is all you need
paper had a transformer architecture
that had two parts an encoder and a
decoder and GPT went just focusing in on
the decoder part. It is essentially a
still a neuronet network um and it has
this attention mechanism inside and you
predict one token at a time. You pass it
through an embedding layer. There's the
transformer block. The transformer block
has attention modules and a fully
connected layer and there are some
normalization layers in between but it's
essentially neuronet network layers with
this attention mechanism. So coming from
GPT2 uh when we move on to GPT OSS there
is for example the mixture of experts um
layer it's not invented by GPOSS it's a
few years old um but it is essentially a
tweak to make the model larger without
consuming more compute in each forward
pass. So there is this fully connected
layer and if listeners are familiar with
um multi-layer perceptrons you can think
of a mini multi-layer perceptron a fully
connected neuronet network layer inside
the transformer and it's very expensive
because it's fully connected if you have
thousand inputs thousand outputs that's
like a 1 million connections and it's a
very expensive part in this transformer
and the idea is to kind of expand that
into multiple feed forward networks. So
instead of having one, let's say you
have 256, but it would make it way more
expensive because now you have 256, but
you don't use all of them at the same
time. So you now have a router that
says, okay, based on this input token,
it would be useful to use this um fully
connected network. And in that context,
it's called an expert. So a mixture of
experts means you have multiple experts.
And depending on what your input is,
let's say it's more math heavy, it would
use different experts compared to let's
say translating input text from English
to Spanish. It would maybe consult
different experts. It's not quite clear,
I mean as clearcut to say, okay, this is
only an expert for math and for Spanish
is a bit more fuzzy. But the idea is
essentially that you pack more knowledge
into the network, but not all the
knowledge is used all the time. That
would be very wasteful. So you're kind
of like during the token generation,
you're more selective. There's a router
that selects which tokens should go to
which expert. Adds more complexity. It's
harder to train. There's a lot of you
know that can go wrong like collapse and
everything. So I think that's why almost
3 still uses uh dense. I mean you have I
think all models with mixture of experts
but dense models where dense means so
also it's jargon. There's a distinction
between dense and sparse. So mixture of
experts is considered sparse because we
have a lot of experts but only few of
them are active. So that's called sparse
and then dense would be the opposite
where you only have like one fully
connected module and it's always you
know utilized. So maybe maybe this is a
good place to also talk about KV cache.
But actually before that even zooming
out like fundamentally how many new
ideas have been implemented from from
GPT2 to today
&gt;&gt; like how different really are these
architectures? Picture like the mixture
of experts um the attention mechanism in
GPToss that would be the group query
attention mechanism. So it's a slight
tweak from multihead attention to group
query attention. So there we have two. I
think they replaced layer norm by RMS
norm, but it's just like a different
normalization layer. Not a big change.
It's just like a tweak. Um the nonlinear
activation function people familiar in
with deep new networks. I mean it's the
same as changing sigmoid with relu. It's
it's not changing the network
fundamentally. It's just like a tweak.
You a little little tweak. Um and that's
about it. I would say it's not really
fundamentally that different. It's still
the same same architecture. So you can
convert one from one uh you can go from
one into the other by just adding these
these changes basically
&gt;&gt; this fundamentally is still the same
architecture.
&gt;&gt; Yep. So for example, you mentioned my
book earlier that's a GPD2 model in the
book because it's simple and it's very
small. Um so 124 120 million parameters
approximately but in the bonus materials
I do have almost three from scratch
gemma 3 from scratch and other types of
from scratch models and I always started
with my GPD2 model and just you know
tweaked a well added different
components and you get from one to the
other. It's like it's kind of like a
lineage in a sense. Yeah. Can you build
up an intuition for people because sort
of when you zoom out you look at it
there's so much rapid advancement in the
AI world and at the same time
fundamentally the architectures have not
changed
&gt;&gt; so where is all the turbulence the
turmoil of the advancement happening
where where's the gains to be had
&gt;&gt; so there are the different stages where
you develop the network um or train the
network you have the pre-training now Um
back in the day it was just pre-training
with GPD2. Now you have pre-training,
mid-training and post-training. Um so I
I think right now we are in the
post-training focus stage. I mean
pre-training still gives you um
advantages if you scale it up to better
higher quality data. But then we have
capability unlocks that were not there
with GPD2. For example uh chat GBT it is
basically a GPT3 model and GPD3 is the
same as GPD2 in terms of architecture.
What was new was adding the um
supervised fine-tuning and the
reinforcement learning with human
feedback. So it's more on the
algorithmic side rather than the
architecture.
&gt;&gt; I would say that the systems also change
a lot. I think if you listen to Nvidia's
announcements, they talk about these
things like you now do FP8, you can now
do FP4. And what is happening is these
labs are figuring out how to utilize
more compute to put it into one model
which lets them train faster and that
lets them put more data in. And then you
can find better configurations faster by
doing this. So you can look at like the
essentially the tokens per second per
GPU is a metric that you look at when
you're doing large scale training and
you could get you can go from like 10k
to 13k by turning on FP8 training which
means you're using less memory per
parameter in the model and by saving
less information you do less
communication you can train faster. So
all of these like system things underpin
way faster experimentation on data and
algorithms that is kind of like it's
this it's this kind of
loop that keeps going where it's kind of
hard to describe when you look at the
architecture and they're exactly the
same but the code base used to train
these models is going to be vastly
different and
&gt;&gt; you could probably like I don't the GPUs
are different but you probably train
GPTOSS 20B way faster and wall clock
time than GPT2 was trained at the time.
Yeah, like you said, they had for
example in the mixture of experts this
NV FP4 optimization for example where
you get more throughput. But I I do
think this is for the speed. This is
true but uh it doesn't give the model
new capabilities in a sense. It's just
how much can we make make the
computation coarser without suffering in
terms of model performance degradation.
Um but I do think I mean there are
alternatives popping up to the
transformer. There's text diffusion
models u completely different paradigm.
Um and there's also I mean though text
diffusion models might use transformer
architectures but it's not an auto auto
reggressive um transformer and also
mamba models uh it's a state space model
but they do have trade-offs and uh
what's right is there's nothing that has
replaced the auto reggressive
transformer as state-of-the-art model.
So like for state-of-the-art you would
still do that go with that thing but
there are now alternatives for the
cheaper end like alternatives that are
kind of um making compromises but it's
not just one architecture anymore there
are little ones coming up but if we talk
about the state-of-the-art it's pretty
much still the the transformer
architecture auto reggressive derived
from GPT2 essentially
&gt;&gt; I guess the big question here is we
talked quite a bit here on the
architecture behind the pre-training
Are the scaling laws holding strong
across pre-training, post-training,
inference, context size, data, synthetic
data?
&gt;&gt; I like to start with the technical
definition of scaling law, which kind of
informs all of this. The scaling law is
a power law relationship between you can
think of the x-axis. So kind of what you
are scaling as a combination of compute
and data
&gt;&gt; which are kind of similar and then the
y-axis is like the held out prediction
accuracy over next token. We talk about
models being auto reggressive. It's like
if you keep a set of
text that the model has not seen, how
accurate will it get when you train? And
the idea of scaling laws came when
people figured out that that was a very
predictable relationship. And I think
that that technical term is continuing.
And then the question is like what do
users get out of it? And then there are
more types of scaling where um OpenAI's
01 was famous for introducing inference
time scaling. And I think less famously
for also showing that you can scale
reinforcement learning training and get
kind of this log x-axis and then a
linear increase in performance on
y-axis. So there's kind of these three
axes now where the traditional scaling
laws are talk talked about for
pre-training which is how big your model
is and how big your data set is and then
scaling reinforcer learning which is
like how long can you do this trial and
error learning that we we'll talk about
we'll define more of this and then this
inference time compute which is just
letting the model generate more tokens
on a specific problem. So I'm kind of
bullish where they they're all really
still working but the lowh hanging fruit
has mostly been taken especially in the
last year on um reinforce learning with
verifiable rewards which is this RLVR
and then inference time scaling which is
just why these models feel so different
to use where previously you would get
that first token immediately and now
they will go off for seconds minutes or
even hours generating these hidden
thoughts before giving you the first
word of your answer and that's all about
this inference time scaling which
such a wonderful kind of step function
in terms of how the models change
abilities. They kind of enabled this
tool use stuff and enabled this much
better software engineering that we were
talking about. And this when we say
enabled almost entirely downstream of
the fact that this reinforce learning
with verifiable rewards training just
kind of let the models pick up these
skills very easily. So it let the models
learn. And so if you look at the
reasoning process when the models are
generating a lot of tokens what it'll be
often doing is it tries a tool it looks
at what it gets back it tries another
API it sees what it gets back and if it
solves the problem so the models when
you're training them very quickly learn
to do this and then at the end of the
day that gives this kind of general
foundation where the model can use CLI
commands very nicely in your repo and
handle git for you and move things
around and organize things or search to
find more information which if we're
sitting in these chairs a year ago is
something that we didn't really think of
the models being doing. So, this is just
kind of something that has happened this
year and is totally transformed how we
think of using AI, which I think is very
magical. It's such an interesting
evolution and just so unlock so much
value, but it's it's like it's not clear
what the next avenue will be in terms of
unlocking stuff like this. I think that
there's there's we'll get to continual
learning later, but there's a lot of
buzz around certain areas of AI, but no
one knows when the next step function
will will really come.
&gt;&gt; So, you you've actually said quite a lot
of things there and said profound things
quickly. It would be nice to unpack them
a little bit. You said you're bullish
basically on every version of scaling.
So, can we just even start at the
beginning pre-training?
Are we kind of implying that the
lowhanging fruit on pre-training scaling
has been picked? Is is is pre-training
hit a plateau or is even pre-training
still you're bullish on?
&gt;&gt; Pre-training has gotten extremely
expensive. I think to scale up
pre-training, it's also implying that
you're going to serve a very large model
to the users. So I think that it's been
loosely established the likes of GPT4
and similar models where around 1
trillion like this order of trillion
parameters at the biggest size. There's
a lot of rumors that they've actually
gotten smaller as training has gotten
more efficient. You want to make the
model smaller because then your costs of
serving go down proportionally. These
models the cost of training them is
really low relative to the cost of
serving them to hundreds of millions of
users. I think deepseek had this famous
number of about $5 million for
pre-training at cloud market rates I
think three um section 2.4 in the paper
we just detailed how long we had the GPU
clusters sitting around for training
which includes engineering issues
multiple seeds and it was like about $2
million to rent the cluster to like deal
with all the problems and headaches of
training a model. So these models are
pretty like a lot of people could get1
to10 million to train a model but the
recurring costs of serving millions of
users is really billions of dollars of
compute. I think that you can look at
close like a thousand GPU rental you can
pay 100 grand a day for and these
companies could have millions of GPUs
like you can look at how much these
things cost to sit around. So that's
kind of a big thing and then it's like
if scaling is actually giving you a
better model like is it going to be
financially worth it and I think it'll
kind of slowly will push it out as AI
solves more compelling tasks. So like
the likes of cloud opus 4.5 making cloud
code just work for things. I think I I
launched this project called like the
atom project which is like American
truly open models in July and that was
like a true vibecoded website and like I
have a job um make plots and stuff and
then I came back to refresh it in the
last few weeks and it's like claw opus
4.5 versus whatever model at the time
was like just crushed all the issues
that I had from building in June and
July and like might be a bigger model
there's a lot of things that go into
this but that's like there's still
progress coming. So, so what you're
speaking to is the nuance of the y-axis
of the scaling laws that the way it's
experienced versus on a benchmark the
actual intelligence is might might be
different but still your intuition about
pre-training if you scale the the size
of compute will the models get better
not whether it's financially viable but
just from the law aspect of it do you
think the models will get smarter
&gt;&gt; yeah and I think that there's and this
sometimes comes off as like almost like
disillusioned from people leadership at
AI companies saying this, but they're
like it's held for 13 orders of
magnitude of computers something like
why would it ever end? So I think
fundamentally it is pretty unlikely to
stop. It's just like eventually we're
not even going to be able to test the
bigger scales because of all the
problems that come with more compute. I
think that there's a lot of talk on how
2026 is a year when very large Blackwell
compute clusters, it's like gigawatt
scale facilities, hyperscalers are
coming online and
these were all contracts for power and
data centers that were assigned and
sought out in like 22 and 2023. So
before or right after ChatgBT. So it
took this 2 to three year lead time to
build these bigger clusters to train the
models. Well, there's obviously immense
interest in building even more data
centers than that. So, that is like kind
of the crux that people are saying is
like these new clusters are coming. The
labs are going to have more compute for
training. They're going to utilize this.
But, it's not a given. And it's like I
I've seen so much progress that I expect
it. And I expect a little bit bigger
models. And I expect um I would say it's
more like we will see a $2,000
subscription this year. We've seen $200
subscriptions. It's like that could 10x
again. And these are the kind of things
that could come and they're all
downstream of this like bit big bit
bigger model that offers just a little
bit more cutting edge.
&gt;&gt; So you know it's reported that XAI is
going to hit that uh 1 gawatt scale
early 26 and full 2 gawatt by year end.
How do you think they'll utilize that in
the context of scaling laws? Is is a lot
of that inference is a lot of that
training. it ends up being all of the
above. So I think that all of your
decisions when you're training a model
come back to pre-training. So if you're
going to scale RL on a model, you still
need to decide on your architecture that
enables this. We're talking about like
other architectures than using different
types of attention. We're also talking
about mixture of experts models. This
sparse nature of models makes it much
more efficient to do um generation which
becomes a big part of um post- training.
And it's like you need to have your
architecture ready so that you can
actually scale up this compute. I still
think most of the compute is going in at
pre-training because you can still make
a model better. You still want to go and
revisit this. You still want the best
base model that you can and in a few
years that'll saturate and the the RL
compute will just go longer. Is there
people who disagree with you that say
basically pre-training is dead. It's all
about scaling inference,
scaling pulse training, scaling context,
continual learning, uh scaling data,
synthetic data.
&gt;&gt; People vibe that way and describe it in
that way, but I think it's not the
practice that is happening.
&gt;&gt; That's just the general vibe of people
saying the things
&gt;&gt; the excitement is elsewhere. So the
lowhanging fruit in RL is elsewhere.
Like for example, we released our model
in November for every company has
deadlines. Our deadline was like
November 20th and our for that our run
was 5 days which compared to 2024 is a
very long time to just be doing post
training at a model of like 30 billion
parameters. It's not a big model and
then in December we had another release
which was just we let the RL run for go
for another three and a half weeks and
the model got notably better so we
release it and like that's a big amount
of time to just allocate to like
something that is going to be your um
peak
&gt;&gt; for the year. So it's like there's these
types of decisions that happen when
they're training a model where they just
like can't they can't leave it forever.
You have to keep
&gt;&gt; you have to keep pulling in the
improvements you have from your
researchers. So that's like you redo
pre-training. You'll do this
post-training for a month but then you
need to give it to your users. You need
to do safety testing. So there's kind of
just like I think there's a lot in place
that reinforces this cycle of just keep
updating the models. There's things to
improve. if you get a new compute
cluster that lets you do something maybe
more stably or faster. It's like you
hear a lot about Blackwell having
rollout issues where at AI2 most of the
models we're pre-training are on like 1
to 2,000 GPUs, but when you're
pre-training on 10,000 or 100,000 GPUs,
you hit very different failures. So GPUs
are known to break in weird ways and
doing a 100,000 GPU run is like you're
pretty much guaranteed to always have at
least one GPU that is down and you need
to have your training code handle that
redundancy, which is just a very
different problem. Whereas like what
we're doing like I'm playing with post
training on DJX Spark or you have your
book. It's like or people learning ML.
It's like what they're battling to train
these biggest models is just like mass
distributed scale and it's a very
different but that's somewhat different
than like are these like that's a
systems problem. In order to enable the
scaling laws especially at pre-training
you need all these GPUs at once. When we
shift to reinforcement learning, it
actually lends itself to heterogeneous
compute because you have many copies of
the model. And
to do a primer for language model
reinforcer learning, what you're doing
is you have two sets of GPUs. One is you
can call it the actor and one you call
the learner. The learner is where your
actual reinforcer learning updates are
going to do. These are traditionally
policy gradient algorithms. Um, proximal
policy optimization, PO and group
relative policy optimization, GRPO are
the two popular classes and on the other
side you're going to have actors which
are generating completions and these
completions are the things that you're
going to grade. So reinforcement
learning is all about optimizing reward.
And in practice, what you can do is that
you can have a lot of different actors
in different parts of the world doing
different types of problems and then you
send it back to this highlyorked compute
cluster to do this actual learning where
where you take the where you take the
gradients and you need to have a tightly
meshed network where you can do
different types of parallelism and
spread out your model for efficient
training. So there's just like a lot of
every different type of training and
serving has these considerations you
need to scale. Like we talked about
pre-training, we talked about RL and
then inference time scaling is like how
do you serve a model that's thinking for
an hour to 100 million users. I'm like I
don't really know about that but I know
that's a hard problem and in order to
give people this intelligence there's
all the systems problems and we need
more compute and you need more stable
compute to do it. But you're bullish on
all of these kinds of scaling is what
I'm hearing on the inference on the
reasoning even on the pre-training.
&gt;&gt; Yeah. So that's a a big can of worms
here. So there basically two the knobs
are the training and the inference
scaling where you can get gains and so
in in a world where we had let's say
infinite compute resources you want to
do all of them like so you have training
you have inference scaling and training
is like a hierarchy it's pre-training
mid-training post-raining changing the
model size more training data making
training a bigger model gives you more
knowledge in the model than the model um
let's say has a better it's like a
better base model back in the day or
still we call it foundation model and it
unlocks. So you but you don't let's say
have the model be able to solve your
most complex task tasks during
pre-training or after pre-training. You
still have these other unlock phases
where you have mid-training non-context
for example post- training with LRVR
that unlocks capabilities that the model
has in terms of just knowledge in the
pre-training and I think sure if you so
do more pre-training you get a better
base model that you can unlock later but
like Nathan said it it just becomes too
expensive. So we don't have infinite
compute. So you have to decide do I want
to spend that compute more on making the
model larger. But you know it's like a
trade-off. It's it's like in ideal world
you want to do all of them. And I think
in that sense scaling is still pretty
much alive. You would still get a better
model. But like we saw with GPD 4.5 it's
just not worth it. I mean it's like cuz
you can let's say you can unlock more
performance with other techniques at
that current moment. Especially um if
you look at inference scaling that's one
of the biggest gains this year with 01.
um where it took a smaller model further
than pre-training a larger model like
GBD 4.5. So it's like I wouldn't say
pre-training scaling is that it's just
like there are other more attractive
ways to scale right now at the moment.
But at some point you know you will
still want to make some progress on the
pre-training. The thing is also to
consider um where you where do you want
to spend your money? If you spend it
more on the pre-training it's like a
fixed cost. You train the model and then
it has this capability forever. you can
always use it and so forth. With
inference scaling, you don't spend money
during training. You spend money later
per query. And then it's also like the
math. How long is my model going to be
on the market if I replace it in half a
year? Maybe it's not worth spending 5
million, 10 million, hundred million on
the training it longer. Maybe it's just
I will just do more inference scaling
and get the performance from there. It
maybe cost me 2 million in terms of user
queries. it becomes a question of how
many users you have and then doing the
math. Um, and I think that's also where
it's interesting where JGBD is in a
position I think they have a lot of
users where they need to go a bit
cheaper where they have that uh GP5
model that is a bit smaller. Other
companies that have as if your customers
have other uh other um trade-offs. For
example, there was also the math
olympiad or some of these these math
problems where JJBT or maybe they had a
proprietary model and I'm pretty sure
it's just like a model has been maybe
fine-tuned a little bit more but most of
it was during inference scaling to
achieve this peak performance in certain
tasks where you don't need that all the
time and but yeah long story short I do
think all of these uh pre-training
mid-training postraining infant scaling
they are all still things you want to do
it's just finding at the moment in this
year it's finding the right ratio that
gives you the best bang for the buck
basically
&gt;&gt; I think this might be a good place to
define pre-training mid-training and
post-training
&gt;&gt; so pre-training is the classic training
one next token prediction at a time you
have a big corpus of data and Nathan
probably also has very interesting
insights there because of three it's a
big portion of the paper focuses on the
right data mix so pre-training is
essentially just you know train cross
entropy loss training on next token
prediction on a a vast corpus of
internet data, books, papers and so
forth. It has changed a little bit over
the years in the sense people used to
throw in everything they can. Now it's
not just raw data. It's also synthetic
data where people re um let's say
rephrase certain things. Uh so synthetic
data doesn't necessarily mean purely AI
madeup data. It's also taking something
from an article, Wikipedia article, and
then rephrasing it as a Q&A question or
um summarizing it, rewarding it, and and
making uh better data that way. Cuz I
think of it also like with humans, if
someone, let's say, reads a book
compared to a messy I don't know, no
offense, but like Reddit post or
something like that. I do think you
learn you, no offense, uh, but I think
&gt;&gt; there's gonna be a post about this.
&gt;&gt; Well, some Reddit data is very coveted
and excellent for training. You just
have to filter it.
&gt;&gt; Yeah,
&gt;&gt; I think that's the idea. Uh I I think
it's like if someone took that and
rephrases that in a let's say more
concise and structured way
&gt;&gt; I think it's higher quality data that
gets the LM maybe the same you get the
same LLM out of it at the end but it
gets there faster it trains faster
because the let's say if the grammar and
the punctuation is correct it already
learns the correct way versus getting
information from a messy way and then
learning later how to correct that and
stuff like that. So I think that is how
pre-training evolved and how um how
still while why scaling still works is
that it's not about just the amount of
data. It's also the tricks to make that
data better for you in a sense. And then
mid-training is I mean it used to be
called pre-training. It's I think it's
called mid training because it was
awkward to have pre-training and
post-training but nothing in the middle
right it sounds a bit weird. You have
pre-training and post- training but
what's the actual training? So the
mid-training is usually similar to
pre-training but you know it's a bit
more I would say specialized in
pre-training. It's the same algorithm
but what you do is you focus for example
on long cont like one example you have
long context documents. The reason you
don't do that during just pure
pre-training is because you don't have
that many long context documents. So you
have a specific phase and one problem of
LMS is also still it's a neuronet
network. It has the problem of
catastrophic forgetting. So you teach it
something, it forgets other things. And
you want to it's not 100% forgetting,
but you know, it's like no free lunch.
You can't It's also the same with
humans. If you ask me some math I
learned 10 years ago, I don't know. I
would have to look at it again.
&gt;&gt; Uh Nathan was actually saying that he's
consuming so much content that there's a
catastrophic forgetting issue.
&gt;&gt; Yeah, I'm like trying to learn so much
about AI. I was like I was learning
about pre-training parallelism. I'm like
I lost something and I don't know what
it was. I don't want tomorphize LLMs but
it's I think the same kind of in that
sense how humans learned I mean quantity
is not always better because yeah it's
like being selective and I in the mid
training is being selective in terms of
quality content at the end so the last
thing the LM has seen is the quality
stuff and then post training is all the
uh fine-tuning supervised finetuning uh
DPO um reinforcement learning with
verifiable rewards with human feedback
and so forth. So the refinement stages
and it's also interesting it's like the
cost thing right I mean it's like
pre-training you spend a lot of money on
that right now RL a bit less RL you
don't really I would say teach it
knowledge it's more like unlocking the
knowledge it's more like a skill
learning like how to solve problems with
the knowledge that it has from
pre-training there are paper actually
three papers this year last year 2025 on
RL for pre-training but I I mean I don't
think anyone does that in production
&gt;&gt; toy examples for now huh
&gt;&gt; toy examples right but to generalize RL
Well, post training is more like the
skill unlock where pre-training is like
soaking up the knowledge essentially.
Yeah.
&gt;&gt; A few things that could be helpful for
people. A lot of people get like have
think of synthetic data as being bad for
training the models. You mentioned like
the deep sea got a OCR which is optical
character recognition paper. A lot of
labs did. AI2 had one like had multiple.
And the reason that each of these labs
has these is because there's vast
amounts of PDFs in other digital
documents on the web that are in formats
that aren't encoded with text easily. So
you use these almost CR these or deepser
and we called ours almost CR to extract
what can be trillions of tokens of um
candidate data for pre-training and
pre-training data sets is on the order
of trillions is measured in trillions of
tokens. smaller models from researchers
can be something like 5 to 10 trillion.
Um Quen has documented going up to like
50 trillion and there's rumors that
these closed labs can go to like 100
trillion tokens. And just getting this
potential data to put in I think they
they have a very big funnel and then the
data you actually train the model on is
a small percentage of this like the this
character recognition data would be
described as synthetic data for
pre-training in a lab. And then there's
also the things like chat GPT now gives
wonderful answers and you could train on
those best answers and that's synthetic
data. It's very different than like
early chat GPT lots of hallucinations
data when people became grounded in
synthetic data. One interesting question
is if I recall correctly 3 was trained
with less data than specifically some
other openw weight models maybe even two
but you still got better performance and
that might be one of the examples how
the data helped.
&gt;&gt; It's mostly down to data quality. I
think if we had more compute, we would
train for longer. I think we ultimately
see that as a like just like something
we would want to do. And especially with
big models, you need to have more
compute because we talk about having
more parameters and we talk about
knowledge and essentially there's a
ratio where big models can absorb more
from data and you're going to you get
more benefit out of this. It's it's like
one of these any logarithmic graph in
your mind is like a small model will
level off sooner if you're measuring
tons of tokens and bigger bigger models
need more. But mostly is we aren't
training that big of models right now
with AI2 and getting the highest quality
data we can is the natural starting
point.
&gt;&gt; Is there something to be said uh about
the topic of data quality? Is there some
lowhanging fruit there still where the
quality could be improved?
&gt;&gt; It's like turning the crank. So I think
historically in the open there's been
like a canonical best pre-training data
set that has moved around between who
has the most recent one or the best
recent effort. like AI2's dolmo was very
early with the first OLO and hugging
face had fine web and there's a DCLM
project which has been kind of like a
which is it stands for data comp
language model there's been data comp
for other machine learning projects and
they have a had a very strong data set
and a lot of it is the internet is
becoming fairly closed off so we have
common crawl which I think is hundreds
of trillions of tokens and you filter it
and it looks like being a lot of
scientific work where you're training
classifiers and making decisions based
on how do you prune down this this data
set into the highest quality stuff and
the stuff that suits your tasks. So
previously language models were tested a
lot more on like knowledge and just kind
of conversational things but now they're
expected to do math and code. So to
train a reasoning model you need to
remix your whole data set and there's a
lot of actually wonderful scientific
methods here where you can you can like
take your gigantic data set you sample a
lot of really tiny things from different
sources. So you say you have GitHub,
Stack Exchange, Reddit, Wikipedia. You
can sample small things from them and
you train small models on each of these
mixes and measure their performance on
your evaluations and you can just do
like basic linear regression and it's
like here's your optimal data set. But
if your evaluations change, your data
set changes a lot. So a lot of old mode
3 was new sources for reasoning to be
better at math and code to and then you
do this mixing procedure and it gives
you the answer. And I think that's a lot
of that's happened at labs this year.
just like there's new hot things whether
it's like coding environments or web
navigation and you just need to bring in
new data. You need to change your whole
pre-training so your post-raining can
work better and stuff like this. So
that's like constant re re-evolution and
the redetermining of what they care
about as their for their models.
&gt;&gt; Are there fun anecdotes of what sources
of data are particularly high quality
that we wouldn't expect? You mentioned
Reddit sometimes can be a source.
&gt;&gt; Reddit was very useful. I think that um
like PDFs is definitely one
&gt;&gt; especially archive.
&gt;&gt; Yeah. So like AI2 has run semantic
scholar for a long time which is a um
like a what you can say is a competitor
to Google Scholar with a lot more
features and to do this AI2 has found
and scraped a lot of PDFs for openly
accessible papers that might not be um
like behind the closed paid garden of a
certain publisher. So like truly open
scientific PDFs and if you like you sit
on all of these and you process it and
you can get value out of it and I think
that like a lot of that style of work
has been done by the frontier labs much
earlier and it's just like you need to
have a pretty skilled researcher that
understands how things change models and
they bring it in and they clean it and
it's it's a lot of labor that like I
think of a lot of frontier labs when
they scale researchers a lot more goes
into data. you have people like if you
want to make if you join a frontier lab
and you want to have impact the best way
to do it is just make find new data
that's better and then like the fancy
glamorous algorithmic things like
figuring out how to make 01 is like the
sexiest thought of a scientist of like
oh I figured out to scale RL and there's
a group that did that but I think most
of the contributions is like
&gt;&gt; I'm going to make the data better or I'm
going to make the infrastructure better
so that everybody in my team can run
experiments 5% faster
&gt;&gt; at the same time I think it's also one
of the closest guarded secrets what your
training any data is for legal reasons.
And so there's also I think a lot of
work that goes into hiding what your
trading data was essentially like trying
the model to not give away the sources
because yeah of legal reasons. The other
thing to be complete is that some people
are trying to train on only licensed
data where common crawl is a scrape of
like the whole internet. So um if I I
host multiple websites, I happy to have
them train language models, but I'm not
explicitly licensing what governs it.
And therefore, this l the common crawl
is largely unlicensed, which means that
your consent really hasn't been provided
for how to use the data. There's another
idea where you can train language models
only on data that has been licensed
explicitly. So that kind of governing
contract is provided. And I'm not sure
if appertise is the copyright thing or
the license thing. I know that the
reason that they did it was for an EU
compliance thing where they wanted to
make sure that their model um fit one of
those checks.
&gt;&gt; And also on that note also, for example,
there's also the distinction between um
the licensing. So some people like you
said, they just purchase the license,
let's say they buy a book online, let's
say an Amazon Kindle book or let's say a
Ming book or something and then use that
in the training data. And that is like
the gray zone because you paid for the
content and you might want to train it.
But then there are also restrictions
where even that shouldn't be allowed and
so that that is like where where it gets
a bit fuzzy and yeah I think that is
right now still a hot topic and also big
companies like OpenAI they approached
private companies for their proprietary
data and private companies they become
more and more let's say uh protective of
their data because they know okay this
is going to be my mode in in a few years
and I do think um that's like an
interesting question where
if LLMs become more commoditized and I
think a lot of people learn about LLM
there will be lot more people able to
train LLMs. Of course there are
infrastructure challenges but if you
think of big industries like
pharmaceutical industries, law, finance
industries, I do think they at some
point will hire people from other
frontier labs to build their in-house
models on their proprietary data which
will be then again another unlock with
pre-training that is currently not there
because even if you wanted to you can't
get that data you can't get access to
clinical trials most of the time and
these types of things. So I do think
scaling in that sense might be still
pretty much alive. If you also look in
domain specific applications because we
are still right now in this year just
looking at general purpose LLMs on on
JPD anthropic and so forth they are just
general purpose. They're not even I
think scratching the surface of what an
LLM can do if it is really specifically
trained and designed for a specific
task. I think on the data thing some
this is one of the things where like
this happened in 2025 and we totally
forget it is Enthropic lost in court and
was owed $ 1.5 billion to authors.
Anthropic I think bought thousands of
books and scanned them and was cleared
legally for that because they bought the
books and that is kind of going through
the system. And on the other side, they
also torrented some books. And I think
this torrenting was the path where the
courts said that they were then culpable
to pay this billions of dollars to
authors, which is just like such a
mind-boggling lawsuit that kind of just
came and went like that is so much money
from the VC ecosystem.
&gt;&gt; These are court cases that will define
the future of human civilization because
it's clearly that data drives a lot of
this. And there's this very complicated
human tension of I mean, you can
empathize. You're both
&gt;&gt; authors. Yeah. there's some degree to
which I mean you put your heart and soul
and your your sweat and tears into the
the writing that you do uh it feels a
little bit like theft
&gt;&gt; for somebody to train your data without
giving you credit
&gt;&gt; and there like Nathan said also two
layers to it someone might buy the book
and then train on it which is could be
argued fair or not fair but then there
are literally straight up um companies
who use pirated books where it's not
even compensating the author is that
that is I think where people got a bit
angry about it specifically.
&gt;&gt; Yeah, but there has to be some kind of
compensation scheme. This is like moving
towards
towards something like Spotify streaming
did originally for music, you know, what
does that compensation look like? You
have to define those kinds of models.
You have to think through all of that.
Uh, one other thing I think people are
generally curious about, I'd love to get
your thoughts. As LMS are used more and
more, if you look at even archive, but
GitHub, more and more of the data is
generated by LLMs.
What do you do in that kind of world is
how big of a problem is that?
&gt;&gt; Largest problem is infrastructure and
systems, but from an AI point of view,
it's kind of inevitable.
&gt;&gt; So, it's basically LLM generated data
that's curated by humans essentially,
right?
&gt;&gt; Yes. And I think that a lot of open
source contributors are legitimately
burning out. If you have a popular open
source repo, somebody's like, "Oh, I
want to do open source AI. It's good for
my career." and they just vibe code
something and they throw it into the you
might get more of this than I do.
&gt;&gt; Yeah. So I have a actually a case study
here. Um I have a repository called ML
extend that I developed as a student I
don't know 15 years 10 years ago and it
is a reasonably popular library still
for certain algorithms I think
especially like frequent data mining
stuff and there was recently I think two
or three people who submitted a lot of
PRs in a very short amount of time. I do
think LMS have been involved in
submitting these PRs. Me as the
maintainer there are two things. First,
I'm a bit overwhelmed like I don't have
time to read through it because
especially it's an older library that is
not a priority for me. At the same time,
I kind of also appreciate it because I
think something people forget is it's
not just using the LLM. There's still a
human you have a human layer that
verifies something and and that is in a
sense also how data is labeled, right?
So that's like um one of the most
expensive things is get getting label
data for RL back and human feedback uh
phases and and this is kind of like that
where it goes through phases and then
you get actually higher quality data out
of it you know so I don't mind it in a
sense it it can feel overwhelming but I
do think there is also value in that it
feels like there's a fundamental
difference between raw LLM generated
data and LLM generated data with human
in the loop that does some kind of
verification even if that verification
is a small percent
&gt;&gt; of the lines of code.
&gt;&gt; Mhm. I think this goes with anything um
like where
people think also sometimes oh I can
just use an LLM to learn about XYZ which
is true you can but there might be a
person who is an expert who might have
used an LLM to write specific code there
is kind of like this human work that
went into it to make it nice and
throwing out the not so nice part to
make it to kind of like predigest it for
you and that saves you time and I think
that's it's that that's the value ad
where you have someone filtering things
or even using the LLMs correctly. I
think this is still labor that that you
get for free if you for example read an
article, let's say substake article. I
could maybe ask an LLM to give me
opinions on that, but I wouldn't even
maybe know what to ask. I think there is
still value in reading that article
compared to me going to the LLM because
you are the expert. you select what
knowledge is actually spot on should be
included and you give me this very this
this um executive summary and and this
is kind of a huge value ad because now I
don't have to waste three 5 hours to go
through this myself maybe get some
incorrect information and so on and so I
think that's also where the future still
is for writers even though there are
LLMs that expert can kind of like save
you time
&gt;&gt; it's kind of fascinating actually watch
and I'm Sure you guys do this, but uh
for me to look at the difference between
a summary
and the original content, even if it's a
page long summary of a page long
content, it's interesting to see how the
summary LMBA summary takes the edge off
like what what is the signal it removes
from the thing.
&gt;&gt; The voice is what I talk about a lot.
&gt;&gt; Voice well voice I would love to hear
what you mean by voice. That's really
powerful. But sometimes there's like
literally insights.
&gt;&gt; Like in removing an insight, you're
actually fundamentally changing the
meaning of the thing. So I'm
continuously disappointed how bad LMS
are at really getting to the core
insights, which is what a great summary
does. Yeah. Even if you go and I have
these extensive extremely elaborate
prompts where I'm like really trying to
dig for the insights and it's still not
quite there which um I mean that's a
whole deep philosophical question about
what is human knowledge and wisdom and
what does it mean to be insightful and
so on but when you talk about the voice
what do you mean
&gt;&gt; so when I write I think a lot of what
I'm trying to do is take what you think
as a researcher which is very raw which
a researcher is trying to encapsulate
ate an idea at the frontier of their
understanding and they're trying to put
what is a feeling into words and I think
that my writing I tried to do this as
the writing which makes it come across
as raw but also high information in in a
way that is like some people will get it
and some won't and that's kind of the
nature of research and I think this is
something that language models don't do
well particularly they're all trained
with this reinforcer learning from human
feedback which is designed to take
feedback from a lot of people and in a
way average how the model behaves from
this and I that there's it's going to be
hard for a model to be very incisive
when there's that sort of filter in it.
And I think this is kind of a wonderful
fundamental problem for researchers in
RHF is like this provides so much
utility in making the models better, but
also the problem formulation is kind of
like there there's this knot in it that
you can't get past. So that's what I
think of is like these language models
don't have this prior and their deep
expression that they're trying to get
at. I don't think it it's impossible to
do. I think there's stories of models
that really shock people. Like I think
of like I would love to have tried Bing
Sydney and does like does that have more
voice cuz it would so often go off the
rails on people in and what is
historically obviously a scary way like
telling a reporter to leave his wife is
a crazy model to potentially put in
general general adoption. But that's
kind of like a trade-off like is this
RHF process like in some ways adding
limitations? That's a terrifying place
to be as one of these frontier labs and
and companies because millions of people
are using them.
&gt;&gt; There was a lot of backlash last year
with the GPT40 getting removed. And I
personally never used the model, but
I've talked to people at OpenAI where
they're to the point where they like get
emails from users that might be
detecting subtle differences in the
deployments in the middle of the night
and they email them and they're like,
"My friend is different." and they like
find these people employees emails and
send them things because they're are so
attached to this set what is a set of
model weights and a configuration that
is deployed to the users. We see this
with Tik Tok. You open it I don't use
Tik Tok but supposedly in like five
minutes the the algorithm gets you. It's
like it's locked in and I don't like
those are language models doing
recommendations like I think there are
ways that you can do this with a
language model within like five minutes
of chatting with it. the model just gets
you and that is something that
&gt;&gt; people aren't really ready for like I
think that like kids like don't give
that to kids like don't give that to
kids at least until we know what's
happening
&gt;&gt; but there's also going to be this
mechanism what's going to happen with
these LLMs as they're used more and more
uh unfortunately the nature of the human
condition is such that people commit
suicide and so what journalists would do
is they will report extensively on the
people who commit suicide and they would
very likely link it to the LLMs because
they have that data about the
conversations. If you're really
struggling in your life, if you're
depressed, if you're thinking about
suicide, you're going to probably talk
to LLMs about it. And so what
journalists will do is they will say,
"Well, the suicide was committed because
of the LLM." And that's going to lead to
the companies
because of legal issues and so on more
and more and more taking the edge off of
the LLM. So, it's going to be as generic
as possible. It's so difficult to
operate in this space because of course
you don't want an LLM to cause harm to
humans at that level. But also
this is also the nature of the human
experience is to have a rich
conversation, a fulfilling conversation,
one that challenges you from which you
grow, you need that edge. And that that
that's something extremely difficult for
AI researchers on the RHF front to
actually have to solve cuz you're
actually dealing with the human
condition. Like a lot of researchers at
these companies are so well motivated
and they definitely the likes of
Anthropic and OpenAI are culturally so
want to do good through this for the
world and there is it's such a like I'm
like I don't want to work on this
because on the one hand a lot of people
see AI as a health ally as somebody they
can talk to about their health
confidentially but then it bleeds all
the way into this
like talking about mental health and
things where that's It's heartbreaking
that this will push like be the thing
where somebody goes over the edge, but
other people might be saved. And I'm
like, I don't like there's things that
as a researcher training models, it's
like I don't want to train image
generation models and release them
openly because I don't want to enable
somebody to have a tool on their laptop
that can harm other people. Like I don't
have the infrastructure at my company to
do that safely. But it's like like
there's a lot of areas like this where
it's just it needs people that will
approach it with the complexity and just
kind of conviction of like it's just
such a hard problem.
&gt;&gt; But also we as a society as users of
these technologies need to make sure
that we're having the complicated
conversation about it versus just
fear-mongering. Big tech is is causing
harm to humans or stealing your data.
All that kind of stuff there. It's it's
more complicated than that. And you're
right. There's a very large number of
people inside these companies. Many of
which you know, many of which I know
that deeply care about helping people.
They are considering the full human
experience of people from across the
world. Not just Silicon Valley, people
across the United States, people across
the world. What that means, what their
needs are. It's really difficult to
design this one system that is able to
help all these different kinds of people
across the different age groups,
cultures, mental states, mental
conditions, all that kind of stuff. I've
wished that the timing of AI was
different with the relationship of big
tech to the average person. So like big
tech's reputation was so low and with
how AI is so expensive, it's like
inevitably going to be a big tech thing
where it takes so many resources and
people say that US is quote unquote
betting the economy on AI with this
buildout. And it's like to have these be
intertwined at the same time is just
makes for such a hard communication
environment. It would be good for me to
go talk to more people in the world that
hate big tech and see AI as a
continuation of this.
&gt;&gt; And one of the things you actually
recommend,
one of the antidotes that you talk about
is uh to find agency in this whole
system as opposed to sort of sitting
back in a powerless way and consuming
the AI slop as it quickly rapidly takes
over the internet. more fine agency by
using that to build stuff, build apps,
build. So you one that actually helps
you build the intuition, but two, it's
empowering because you you're going to
understand how it works, what the
weaknesses are, and it allows it gives
your voice power to say like this is
fucked up, this is bad, this is bad use
of the technology, and this is good use
of technology. And you're more plugged
into the system than so you can
understand it better and you can steer
it better as as a as a consumer. I think
it's a good point you brought up agency
instead of ignoring it and saying,
"Okay, I'm not going to use it." I think
it's probably long-term healthier to
say, "Okay, it's out there. I can't put
it back." You know, like internet
computers back then when they came out.
How do I make best use of it? And how
does it help me to uplevel myself? The
one thing I worry here though is like if
you just fully use it for something you
love to do, the the thing you love to do
is not no longer there and that could
potentially I feel like lead to burnout.
For example, if I use an LM to do all my
coding for me, now there's no coding.
I'm just managing something that is
coding for me. Two years, let's say
later, if I just do that 8 hours a day,
have something code for me. Do I feel
fulfilled still? Like, is this like
Yeah. I mean, is this like hurting me in
terms of being excited about my job,
excited about what I'm doing? Am I still
proud to build something? So there's uh
on that topic of enjoyment, it's it's
quite interesting. We should just throw
this in there that there is this recent
survey of about 791 professional
developers. Professional meaning 10 plus
years of experience.
&gt;&gt; That's a long time.
&gt;&gt; Yeah.
&gt;&gt; As a junior developer.
&gt;&gt; Uh yeah, in this day and age. Uh so the
the results here on many fronts are uh
surprising. So they break it down by
junior and senior developers. But I mean
it just shows that both junior and
senior developers
use AI generated code in code they ship.
So this is not just for fun sort of
intermediate kind of learning things.
This is code they ship. And so it's 25%
like most of them use around 50% or
more. And what's interesting is for the
category of over 50% of your code that
you ship is AI generated, senior
developers are much more likely to do
so. But you don't want AI to take away
the thing you love.
&gt;&gt; Y
&gt;&gt; I think this speaks to my experience
these particular results I'm about to
say. So together about 80% of people
find it either somewhat more enjoyable
or significantly more enjoyable to use
AI as part of the work.
&gt;&gt; I think it depends on the task. Um for
my personal usage for example I have a
website where I sometimes tweak things
on the website. I personally don't enjoy
this. So in that sense if the AI can
help me to implement something on my
website I'm all here for it. It's it's
great. But then at the same time when I
solve a complex problem well if there's
a bug and I hunt this bug and I find the
bug it's the best feeling in the world.
It's like you get so much joy like oh
it's like you feel like great. But now
if you don't even think about thinking
about the bug, you just go directly to
the LLM. Well, you never have this kind
of feeling, right? But then there could
be the middle ground where well you try
yourself, you can't find it, you use the
LLM and then you don't get frustrated
because it helps you and you move on to
something that you enjoy. So I think
looking at these statistics I think also
the difference is or what is not
factored in it's averaging over all the
different scenarios where we don't so we
don't know if it's for the core task or
if it's for something mundane that
people would not have enjoyed otherwise.
So in a sense AI is really great for
doing mundane things that um take a lot
of work. Um so for example my wife the
other day u she has like a podcast for
like book uh like book discussions a
book club and she was like transferring
show notes from Spotify to YouTube and
then the links somehow broke uh and she
had in some episodes because it is
custom many books like 100 links or
something and it would have been really
painful to go in there and fix each link
manually and so I suggested hey let's
try we copied the text into chachib and
it fixed them and instead of two hours
going from link to link fixing that. You
know, it made that type of work much
more seamless. There was no frustration
fixed. I think everyone has a use case
where AI is useful for something like
that. That would be really boring,
really mundane. I for me personally
since we're talking about coding uh and
you mentioned debugging uh what a lot of
the source of the enjoyment for me on
more on the cursor side than the clog
code side is the I have a friend I have
a co- what's that called a pair
programmer like I it's less lonely you
you made debugging sound like this great
joy no I would say I would say debugging
is like a drink of water after you've
been going through or desert for for for
days. So like you you skip the whole
desert part where you're suffering. So
like there sometimes it's nice to have a
friend who's who can't really find the
bug but can give you some intuition
about the code and you're together the
with that friend going through the
desert and then together find that drink
of water. So at least for me uh maybe
speaks to the loneliness of the
programming experience. It's that is a
source of joy. Yeah, it's maybe also
related to delayed gratification. I'm a
person who, you know, even as a kid, I
like the idea of Christmas presents,
having them, getting them better than
actually getting the presents. I would
look forward to the day I get the
presents, but then it's over and I'm
disappointed. And maybe it's something
like also with, let's say, food. I think
food tastes better when you're really
hungry. Uh, and with Yeah, you're right.
With debugging, it is not always, you
know, uh, great. it's like often
frustrating but um then there if you can
solve it then it's great but there's
also like a sweet goldilock zone where
if it's too hard then it's you know
wasting your time but I think that is
another challenge though how will people
learn I mean the chart we looked at um
we saw that more senior developers are
shipping more AI generated code than the
junior ones and I think it's very
interesting because intuitively you
would think it's the junior developers
because they don't know let's say how to
do the thing yet because they are more
junior and so they use AI to do that
thing. It could either mean the AI is
not good enough yet to solve that task
but it could also mean experts are more
effective at using it. they know where
and better how to use it and review the
code and they trust the code than more
and so I think one issue in the society
in the future will be though how do you
become an expert if you never try to do
the thing yourself and I think one way
it's always like for me how how I learn
is by trying things myself like math
textbooks if you look at the solutions
yeah you learn something but I think you
learn actually better if you try first
and then you appreciate the solution
differently because you know how to put
it into your mental framework and um if
LMS are here all the time would you
actually go through the length at
struggling would you be willing to
struggle because struggle is not nice
right I mean it's struggling and if you
use the LM to do everything at some
point you will never really take the
next step and then you will maybe not
get that unlock that you would get as an
expert using an LLM so it's like you
know it's like I think there's like a
goal sweet spot where maybe the trick
here is you make dedicated offline time
where you study 2 hours a day and the
rest of the day use LLMs but I I think
it's important also for people to still
invest in themselves in my opinion to
not just you know LLM everything.
&gt;&gt; Yeah, there is and we together
civilization that we each individually
have to find that goal log zone
&gt;&gt; uh and in the programming context as
developers. Now we've had this
fascinating conversation that started
with pre-training and mid-training.
Let's get to post-training.
A lot of fun stuff in post training. So
what are some of the interesting ideas
in post training?
&gt;&gt; The biggest one from 2025 is learning
this reinforcement learning with
verifiable rewards. You can scale up the
training there which means doing a lot
of this kind of iterative generate grade
loop and that lets the models learn both
interesting behaviors on the tool use
and software side. could be searching
running commands on their own and seeing
the outputs and then also that training
enables this inference time scaling very
nicely and it just turned out that this
paradigm was very nicely linked in this
whereas this kind of RL training enables
inference time scaling but inference
time scaling could have been found in
different ways so it was kind of this
perfect storm of the models change a lot
and the way that they're trained is a
major factor in doing so and this has
changed how people approach
post-training
dramatically.
&gt;&gt; Can you describe RLVR popularized by
Deepseek R1? Can you describe how it
works?
&gt;&gt; Yeah, fun fact. Um, I was on the team
that came up with the term RLVR, which
is from our Tulu 3 work before DeepSeek,
which is we don't take a lot of credit
for the being the people to popularize
the scaling RL. But as fun as what
academics get as an aside is the ability
to name and influence the discourse
because the closed labs can only say so
much. That one of the things you can do
as an academic is like you might not
have the compute to train the the model
but you can frame things in a way that
ends up being I describe it as like a
community can come together around this
RLVR term which is very fun. And then
Deep Seek is the people that did the
training breakthrough which is they
scaled the reinforcer learning which was
you have the model generate answers and
then grade the completion if it was
right and then that accuracy is your
reward for reinforcement learning. So
reinforcement learning is classically an
agent that acts in an environment and
the environment gives it a state and and
a reward back and you try to maximize
this reward. In the case of language
models, the reward is normally accuracy
on a set of verifiable tasks, whether
it's math problems, coding tasks, and it
starts to get blurry with things like
factual domains like that is also in
some ways verifiable or constraints on
your instruction like respond only with
words that start with a. Like all of
these things are verifiable in some way.
And the core idea of this is you find a
lot more of these problems that are
verifiable and you let the model try it
many times while taking these RL steps,
these RL gradient updates. The
infrastructure evolved from this
reinforced learning from human feedback
where in that era the score they were
trying to optimize was a learned reward
model of aggregate human preferences. So
you kind of change the problem domains
and that let the optimization go on to
much bigger scales which kind of
kickstarted a major change in what the
models can do and how people use them.
&gt;&gt; What kind of domains is uh RVR amendable
to
&gt;&gt; math and code are the famous ones and
then there's a lot of work kind of on
what is called the rubrics which is
related to a word people might have
heard as LM as a judge which is like for
each problem I'll have a set of problems
in my training data set. I'll then have
another language model and ask it what
would a good answer to this problem look
like and then you could try the problem
a bunch of times over and over again and
assign a score based on this rubric. So
that's not necessarily verifiable like a
math and code domain but this rubric's
idea and other scientific problems that
might be a little bit more vague is
where a lot of the attention is where
they're trying to push this set of
methods into these kind of more
open-ended domains where the models can
learn a lot more. I think that's called
reinforcement learning with AI feedback,
right?
&gt;&gt; That's the older term from it that was
coined in anthropics constitutional AI
paper. So it's like a lot of these
things come in cycles.
&gt;&gt; Also just one step back for the RLVR. So
I think the interesting beautiful thing
here is that you you ask the LM let's
say a math question and then you know
the correct answer and you let the LLM
like you said figure it out. But how it
does it I mean you don't really
constrain it much. There are some
constraints you can add like use the
same language, don't switch between
Spanish and English. But let's say
you're pretty much hands off. You only
give the question and the answer and
then the LM has to you know just the
task to arrive at the right answer. But
the beautiful thing here is what happens
in practice is that the LM will do a
step-by-step description like you know
like as a student or like as a yeah
mathematician how you would derive the
solution. it will give you or it will
use those steps and that helps actually
the model to improve its own accuracy
and then like you said the inference
scaling. So inference scaling loosely
means basically spending more compute
during using the LM during inference and
here the inference scaling is that the
model would use more tokens and and also
I think in the R1 paper they showed the
longer they train the model the longer
the responses are they they grow over
time they use more tokens so it becomes
more expensive becomes more expensive
for simple tasks but these explanations
they help the model with the accuracy
there also interesting lot of papers
showing what the model explains does not
necessarily have to be correct or maybe
it's even unrelated to the answer but
for some reason it still helps the model
like this this the fact that it is um
explaining and I think it's also again I
don't want to anthropomorphize these
LLMs but it's kind of like how we humans
operate right if there's a complex math
problem let's say in a math class you
usually have a note paper and you do it
step by step you cross out things and
the model also self-corrects and that
that was I think the aha moment in the
R1 paper they called it aha moment
because the model itself recognized that
made a mistake and then said, "Ah, I did
something wrong and so let me try." And
I think that's just so cool that this
falls out of just giving it the correct
answer and having it figure out how to
do it that it kind of does in a sense
what a human would do. Although Adams
don't think like humans, it's kind of
like an interesting coincidence. And the
the other nice side effect is it's great
for us humans often to see these steps.
It builds trust, but also we learn, we
can double check things. There's a lot
in here. I think some of the debate
there's been a lot of debate this year
on if the language models like these aha
I think the aha moments are kind of fake
because in pre-training you essentially
have seen the whole internet. So you
have definitely seen people explaining
their work even verbally like a
transcript of a math lecture. You try
this, oh I messed this up and what
reinforcement learning is this RLVR is
very good at doing is amplifying these
behaviors because they're very useful in
enabling the model to think longer and
to check its work. And I agree that it
is very beautiful that this training
kind of the model learns to amplify this
in a way that is just so useful at the
final answers being better. I can give
you also a hands-on example. I was
training the Gwen 3 base model with RLVR
on math 500. The base model had an
accuracy of about 15%. Just 50 steps
like in a few minutes with RLVR the
model went from 15% to 50% accuracy. And
then what you can't tell me it's
learning anything about fundamentally
about math in
&gt;&gt; the Quinn example is weird because
there's been two papers this year. One
of which I was on that talks about data
contamination in Quinn and specifically
that they trained on a lot of this
special mid-training phase that we had
like a minute on. It's weird. They train
on problems that are almost identical to
that.
&gt;&gt; Exactly. And so you can see that
basically the RL it's not teaching the
model any new knowledge about math. You
can't do that in 50 steps. So the
knowledge is already there in the
pre-training. You're just unlocking it.
I still disagree with the kind of
premise because there's a lot of weird
complexities that you can't prove
because one of the things that points to
weirdness is that if you take the Quen 3
so-called base model and you can you
could Google on the screen you could
Google like math data set hugging face
and you could take a problem and what
you do if you put it into quen 3 base
the all these math problems have words
so it be like Alice has five apples and
takes one and gives three to whoever and
there are these word problems with these
Quenbased models why people are
suspicious of them is if you change the
numbers but keep the words
&gt;&gt; Quen will produce like a very high
without tools will produce a very high
accuracy like decimal representation of
the answer which means there's some like
at some time it was shown problems that
were almost identical to the test set
and it was using tools to get a very
high precision answer but a language
model without tools will never actually
have this. So, it's kind of been this
big debate in the research community is
like how much of these reinforce
learning papers that are training on
Quen and measuring specifically on this
like math benchmark where there's been
multiple papers talking about
contamination is like how much can you
believe them? And I think this is what
caused the reputation of RLVR being
about formatting because you can get
these gains so quickly and therefore it
must already be in the model. But
there's a lot of complexity here that we
it's not really like controlled
experimentation. So, we don't really
know. But if it weren't true um I would
say distillation wouldn't work right I
mean distillation can work to some
extent but the thing is that is I think
the biggest problem in LM research this
contamination because we don't know
what's in the data it's unless you have
a new data set it's really impossible
and the same you mentioned um math the
math data set which is a question and
then answer and an explanation is given
but then also even something simpler
like uh MMLU which is a multiple choice
benchmark if you just change the format
slightly um like I don't know you use
par a dot instead of a parenthesis or
something like that. The model accuracy
will vastly differ.
&gt;&gt; I think that that's could be like a
model issue rather than a general issue.
It's not even malicious by the
developers of the LM like hey we want to
cheat at that benchmark. It's just it
has seen something at some point and I
think the only fair way to evaluate an
LLM is to have a new benchmark that is
after the cut off date when the LM was
deployed. Can we lay out what would be
the sort of the recipe of all the things
that will be go into post training? And
you mentioned our RLVR was a really
exciting effective thing maybe we should
elaborate. RHF still has a really
important component to play. What kind
of other ideas are there on post
training?
&gt;&gt; I think you can kind of take this in
order. I think you could view it as what
made 01 which is this first reasoning
model possible or what will the latest
model be and they actually have you're
going to have similar interventions at
these where you start with mid-training
and the thing that is rumored to enable
01 and similar models is really careful
data curation where you're providing a
broad set of like what is called
reasoning traces which is just the model
generating words in a forward process
that is reflecting like breaking down a
problem into intermediate steps and
trying to solve them. So at mid-training
you need to have data that is similar to
this to make it so that when you move
into post- training primarily with this
verifiable rewards it can learn. And
then what is happening today is you're
figuring out which problems to give the
model and how long you can train it for
and like how much inference you can
enable the model to use when solving
these verifiable problems. So as models
get better certain problems are no
longer like the model will solve them
100% of the time and therefore there's
very little signal in this. If we pull
if we look at the GRPO equation, this
one is famous for this because
essentially the reward given to the
agent is based on how good a given
action a action is a completion is
relative to the other answers to that
same problem. So if all the problems get
the same answer, there's no signal in
these types of algorithms. So what
they're doing is they're finding harder
problems, which is why you hear about
things like scientific domains, which is
like that's so hard like getting
anything right there. If you have a lab
or something, it just generates so many
tokens or much harder software problems.
So the frontier models are all pushing
into these harder domains and they can
train on more problems and the model
will learn more skills at once. The RHF
link to this is kind of like RHF has
been and still is kind of like the
finishing touch on the models where it
makes models more useful by improving
the organization or style or tone.
There's different things that resonates
to different audiences. Like some people
like a really quirky model and RHF could
be good at enabling that personality and
some people hate this like markdown
bulleted list thing that the models do
but it's actually really good for
quickly parsing information and RHF this
human feedback stage is really great for
just give putting this into the model at
the end of the day. So it's what made
chatbt so magical for people and that
use has actually remained fairly stable.
This formatting can also help the models
get better at math problems for example.
So it's like the border between style
and formatting and like the method that
you use to answer a problem is actually
um they're all very closely linked in
terms of when you're training these
models. Which is why ROHF can still say
make a model better at math, but these
verifiable domains are a much more
direct process to doing this because
it's kind of makes more sense with the
problem formulation, which is why it
kind of ends up all forming together.
But to summarize, it's like mid-training
is give the model the skills it needs to
then learn. RL and verifiable rewards is
let the model try a lot of time. So put
a lot of compute into trial and error
learning across hard problems. And then
RHF would be like finish the model, make
it easy to use and kind of just round
the model out.
&gt;&gt; Can you comment on the amount of compute
required for RLVR?
&gt;&gt; It's only gone up and up. So I think
Grock 4 was famous for saying they use a
similar amount of compute for
pre-training and post-raining. Back to
the scaling discussion, they involve
very different hardware for scaling.
pre-training is very computebound which
is like this flops discussion which is
just how many matrix multiplications can
you get through in one time and because
RL you're generating these answers
you're trying the model in the real
world environments it ends up being much
more memory bound because you're
generating long sequences and the
attention mechanisms have this behavior
where you get a quadratic increase in
memory as you're getting to longer
sequences so the compute becomes very
different so you when in pre-training we
would talk about a model I think if we
go back to like the Biden administration
executive order, it's like 10 to the
25th flops to train a model. If you're
using flops in post- training, it's a
lot weirder because the reality is just
like how many hours are you allocating
how many GPUs for? And I think in terms
of time, the RL compute is getting much
closer because you just can't put it all
into one system. Like pre-training is so
computationally dense where all the GPUs
are talking to each other and it's
extremely efficient. where RL has all
these moving parts and it can just take
a long time to generate a sequence of
100,000 tokens. Like if you think about
GBT 5.2 Pro taking an hour, it's like
what if your training run has a sample
for an hour and you have to make it so
that's handled efficiently. So I think
in GPU hours or just like wall clock
hours, the RL runs are probably
approaching the number of days as
pre-training, but they probably aren't
using as many GPUs at the same time.
There's rules of thumb where in labs
it's like you don't want your
pre-training runs to last more than like
a month because they fail
catastrophically. And if you were
planning a huge cluster to be held for
two months and then it fails on day 50,
the opportunity cost is just so big. So
you kind of don't want to just you
people don't want to put all their eggs
in one basket which is like GBT4 was
like the ultimate yolo run and nobody
ever wanted to do it before where it
took like three months to train and
everybody was shocked that it worked
where I think people are a little bit
more cautious and incremental now. So RL
VR is more let's say unlimited how much
you can train or get still benefit where
RLHF because it's a preference tuning it
you reach a certain point where it
doesn't really make sense to spend more
RL budget on that. So just a step back
with um preference tuning. So there are
multiple people that can give multiple
let's say explanations for the same
thing and they can both be correct. But
at some point you learn a certain style
and it doesn't make sense to you know
iterate on it. My favorite example is
like if relatives ask me what laptop
they should buy. I give them an
explanation or ask them like yeah what
is your um use case like they for
example prioritize battery life and
storage. Other people like us for
example, we would prioritize RAM and
compute and so but both both answers are
correct but different people require
different answers and with preference
tuning well you're trying to average
somehow like you are asking data
labelers to give you the right or not
write a preferred answer and then you
train on that but at some point yeah you
learn that average preferred answer and
there's no I think reason to keep
training longer on it because you know
it's just a style where with RLVR you
literally give the model well we let the
law model solve more and more complex
difficult problems and so I think that
it makes more sense to allocate more
budget long-term to LRVR and also that
right now we are in LRVR 1.0 O blend
where it's still like that simple thing
where we have a question and answer but
we don't do anything with the one stuff
in between. So there also I mean
multiple research papers also by Google
for example on process reward models
that also give scores for the
explanation how correct is the
explanation and I think that will be the
next thing let's say our LVR 2.0 O for
this year focusing in between question
and answer like how to leverage that
information the explanation to improve
the explanation and help it to get
better accuracy but then uh so that
that's one angle and there was a
deepseek math version two paper where
they also had interesting uh inference
scaling there where well first they had
um developed models that grade
themselves a separate model and I think
that that will be one aspect and the
other like Nathan mentioned it will be
for LRVR are branching into other
domains.
&gt;&gt; The place where people are excited are
value functions which is pretty similar.
So process reward models
&gt;&gt; are kind of like process reward models
assign how good something is to each
kind of intermediate step in a reasoning
process where value functions apply
value to every token the language model
generates. Both of these have been
largely unproven in the language
modeling and this reasoning model era.
people are more optimistic about value
functions forever for whatever reason
now. I think process reward models were
tried a lot more in this pre1
pre-ereasoning model era and a lot of
people had a lot of headaches with them.
So I think a lot of it is the human
nature of like value models have a very
deep history in reinforcement learning.
They're one of the first things that
were core to like deep reinforce
learning existing is like training value
models in this. So right now the
literature people are excited about
trying value models but there's very
little proof in it and there are
negative examples in trying to scale up
process reward models. These things
don't always hold in the future. I think
we came to this discussion by talking
about scaling and a simple way to
summarize what you're saying with like
you don't want to do too much RHF which
is eventually the signal scales is
people have worked on RHF for language
models for years especially in intense
intraeptor chatbt and this the first
release of a reasoning model trained
with RLVR open ais1 had a scaling plot
where if you increase the training
compute logarithmically you get a linear
increase in evaluations and this has
been reproduced multiple times I think
deepseeek had plot like this but there's
no scaling law for RLHF where if you log
increase the compute you get some
performance in fact the seminal scaling
paper for RLHF is scaling laws for
reward model over optimization so it's
like that's a big line to draw with RLVR
and the methods we have now and in the
future like they will follow the scaling
paradigm which is like the best runs you
can let to run for an extra 10x and you
get a fewx performance but you can't do
this with RHF and that is just going to
be field defining and how people
approach them where I'm a shill for
people academically to do RHF and that's
a good way to describe it is like to do
the best RHF you might not need the
extra 10 or 100x of compute but to do
the best RLVR you do so I think there's
a what I say is a seminal paper from
what was a meta internship it's called
it's like the art of scaling reinforcer
learning with language models their what
they describe as a framework is scale RL
and their incremental experiment
was like 10,000 B200 hours which is like
thousands or tens of thousands of
dollars per experiment and they do do a
lot of them which is just like this cost
is not accessible to the average
academic which is a hard equilibrium
where it's trying to figure out how to
learn from each community.
&gt;&gt; I was wondering if we could take at this
point a bit of a tangent and talk about
education and learning. If you're
somebody listening to this who's a smart
person interested in programming,
interested in AI, so I presume building
something from scratch is a good
beginning. So can you just take me
through like what you would recommend
people do? So I would personally start
like you said uh implementing a simple
model from scratch that you can run on
your computer. The goal is not if you
build a model from scratch to have like
something you use every day for your
personal projects. like it's not going
to be your personal assistant replacing
an existing open weight model or CHPD.
It's to see what exactly goes into the
LLM, what exactly comes out of the LLM,
how the pre-training works in that sense
on your own computer preferably. Um, and
then you learn about the pre-training,
the supervised fine-tuning, the
attention mechanism. You get a solid
understanding of how things work. But at
some point you will reach a limit
because small models can only do so
much. And the problem with learning
about LLMs at scale is I would say it's
exponentially more complex to make a
larger model because it's not that the
model just becomes larger. You have to
now think about sharding your parameters
across multiple GPUs. You even for the
KV cache there are multiple ways you can
implement it. One is just to understand
how it works just to grow the cache.
That's is like a cache you grow step by
step by let's say concataining lists um
growing it. But then that wouldn't be
optimal in GPUs. You wouldn't do that.
You would pre-allocate a tensor and then
fill it in. But that adds again another
20 30 line lines of code and for each
thing you add so much code and I think
the trick with a book is basically to
understand how the LLM works. It's not
going to be your production level LLM.
But once you have that you can
understand the production level LM.
&gt;&gt; So you're trying to always build an LLM
that's going to fit on one GPU.
&gt;&gt; Yes. The most of them I have they I have
some bonus materials on some models. I
think one of or two of them they may
require multiple GPUs but the goal is to
have it on one GPU and the beautiful
thing is also you can selfverify it's
almost like RLVR when you code these
from scratch you can take uh an existing
model from the hugging face transformer
library um so the hugging face
transformer library is great but if you
want to learn about LMS I think that's
not the best place to start because the
code is so complex because it has to
full it has to fit so many use cases
also some people use it in production it
has to be
sophisticated and it's really
intertwined and really hard. It's not
linear to read. It was started as a
fine-tuning library and then it grew to
be like the standard representation of
every model architecture and the way
this loaded. So hugging face is like the
default place to get a model and
transformers is the software that
enables it. So people can easily load a
model and do something basic with it.
&gt;&gt; And all frontier labs that have open
weight models have a hugging phase
transformers version of it like from
deepseek to GPTOSS. So it's like the
canonical weight that you can load
there. But again also even transformers
the library is not used in production.
People use then sglang or vlm and it
adds another layer of complexity.
&gt;&gt; We should say that the transformers
library has like 400 models.
&gt;&gt; So it's a one library that tries to
implement a lot of LLMs and so you have
a huge code base. Basically it's like
huge. It's like it's I don't know maybe
millions hundreds of thousands of lines
of code and find it's like understanding
the part that you want to understand is
finding the needle in the haststack. But
what's beautiful about it is you have a
working implementation and so you can
work backwards from it. What I would
recommend doing or what I also do is if
I want to understand for example how
almost 3 is implemented I would look at
the weights in the model hub the config
file and then you can see oh they used
so many layers they use let's say group
query attention or multi head attention
in that case then you see all the
components in like a human readable I
don't know 100 lines of config file and
then you start let's say with your GPD2
model and add these things you know and
the cool thing here is you can then load
the pre-trained weights and see if they
work in your model and you want to match
the same output that you get with a
transformer model and then you can use
that as a basically as a verifiable
reward to make your architecture correct
and then it's kind of sometimes it takes
me a day to with almost three the
challenge was the rope for the position
embeddings they had a yarn extension and
there was some custom uh scaling there
and I couldn't quite match the these
things and in this struggle you kind of
understand things but the cool thing is
at the end you know you have it correct
because you can unit test it. You can
check against the reference
implementation and I think that's maybe
one of the best ways to learn really
like to basically reverse engineer
something. Yeah,
&gt;&gt; I think that that is something that
everybody that's interested in getting
to AI today should do. And I think
that's why I liked your book is like I
came to language models from this RL and
robotics field. like I never had taken
the time to just like learn all the
fundamentals and this transformer
architecture I describe as being like so
fundamental as like deep learning was a
thing that I had to learn in the past
and people need to do this. I think that
where a lot of people kind of get
overwhelmed is how do I apply this to
have impact or find like a career path
because like AI and language models make
this fundamental stuff so accessible and
people with motivation will learn it and
then it's like how do I get the cycles
on goal to contribute to research and I
think that I'm actually fairly
optimistic in this because the field
moves so fast that a lot of times the
best people like don't fully solve a
problem because there's a bigger lower
hat like a bigger problem to solve
that's very low hanging fruit so they
move on and I think that a lot of what I
was trying to do in this RHF book is
like take post-trading techniques and
just describe how people think about
them influencing the model and what
people are doing and then it's
remarkable how many things I just think
are just like people stop studying them
or don't so I think people trying to get
narrow after doing the fundamentals is
good and then reading the relevant
papers and being engaged in the
ecosystem. It's like you actually the
proximity that random people have online
from the leading researchers like no one
knows who all the anonymous account on X
and ML is very popular for whatever
reason and no one knows who all these
people are like it could just be random
people that study the stuff deeply
especially with the AI tools to just be
like keep I don't understand this keep
digging into it I think is a very useful
thing but there's a lot of research
areas that just like are maybe three
papers that you need to read and then
one of the authors will probably email
pay you back. But you have to put in a
lot of effort into these emails to
understand the field. Like I think it
would be for a newcomer easily weeks of
work to feel like they can truly grasp
like what is a very narrow area. But I
think going narrow after you have the
fundamentals be very useful to people
because
&gt;&gt; it's like I became very interested in
like character training which is like
how you make the model funny or
sarcastic or serious and like what do
you do to the data to do this? And it's
like a student at Oxford reached out to
me. He's like, "Hey, I'm interested in
this." And I advised him and I was like,
"That paper now exists." And it's like,
I don't know, there's like two or three
people in the world that were very
interested in this. He's a PhD student,
which gives you an advantage. But like
for me, that was a topic I was waiting
for someone to be like, "Hey, I have
time to spend cycles on this." And I'm
sure there's a lot more very narrow
things where you're just like, "Oh, it
doesn't make sense that there was no
answer to this." And I think that it's
just like there's so much information
coming that people are like, I can't
grab onto any of these. But if you just
actually stick in an area, I think
there's a lot of interesting things to
learn. Yeah, I think you can't try to do
it all because it would be very
overwhelming and you would burn out if
you try to keep up with everything. For
me, for example, I haven't kept up with
computer vision a long time, just focus
on LMS. But coming back to your book,
for example, I think this is also a
really great book and a really good bang
for the buck because you want to learn
about RLHF. I wouldn't go out there and
read RL HF papers because I would be you
would be spending
&gt;&gt; contradict there. I just edited the book
and I was like there's a chapter where I
had to be like X papers say one thing
and X papers say another thing and we'll
see what comes out to be true.
&gt;&gt; What are some of the just to go through
some of the table of contents some of
the ideas we might have missed in the
bigger picture of the post training. So
first of all you do the problem setup
training overview what are preferences
preferences data in the optimization
tools reward modeling regularization
instruction tuning rejection sampling
reinforcement learning I policy
gradients direct alignment algorithms
then constitutional AI and AI feedback
reasoning and inference time scaling
tool use and function calling synthetic
data and distillation evaluation and
then open question section over
optimization style and information and
then product UX character and post
training. So, what are some ideas worth
mentioning that connect both the
educational component and the research
component? You mentioned the character
training which is pretty interesting.
&gt;&gt; Character training is interesting
because there's so little out of it, but
we talked about how people engage with
these models and like like we feel good
using them because they're positive, but
that can go too far. or it could be too
positive and it's like essentially it's
how do you change your data and or
decision- making to make it exactly what
you want and like OpenAI has this thing
called a model spec which is essentially
their internal guideline for what they
want the model to do and they publish
this to developers so essentially you
can know what is a failure of OpenAI's
training which is like they have the
intentions and they haven't met it yet
versus what is something that they like
actually wanted to do and that you don't
like and that transparency is very nice
But all the methods for curating these
documents and how easy it is to follow
them is not very wellnown. I think the
way the book is designed is that the
reinforce learning chapter is obviously
what people want because everybody hears
about it with RLVR and it's the same
algorithms and the same math but it's
just like you can use it in in very
different documents. So I think the core
of pref
is like how messy preferences are is
essentially rehash of a paper I wrote
years ago. This is essentially the
chapter that'll tell you why RHF is
never ever fully solvable because like
the way that even RL is set up is that
um it assumes that preferences can be
quantified and that multiple preferences
can be reduced to single values. And I
think it relates in the economics
literature to the vonoman Morganston
utility theorem. And like that is the
chapter where all of that philosophical,
economic and like psychological context,
it tells you what gets compressed into
doing RHF. So it's like you have all of
this and then at later in the book it's
like you use this RL math to make the
number go up. And I think that that's
why I think it would be very rewarding
for people to do research on is because
it's like quantifying preferences is
something that is just like humans have
designed the problem in order to make
preferences studyable. But there's kind
of fundamental debates on like an
example is in a language model response
you have different things you care about
whether it's accuracy or style and when
you're collecting the data they all get
compressed into like a I like this more
than another and it's like like that is
happening and there's a lot of philosoph
there's a lot of research in other areas
of the world that go into like how
should you actually do this I think
social choice theory is the sub field of
economics around how you should
aggregate preferences and there's Like I
was I went to a workshop that published
a white paper. I'm like, how can you
think about using social choice theory
for ROHF? So I mostly would want people
that get excited about the math to come
and have things where they could stumble
into and learn this kind of broader
context. I think there's a fun thing I
just keep a list of all the tech reports
I like of reasoning models. So in the in
chapter 14, which is kind of like a
short summary of RLVR, there's just like
a gigantic table where I just like list
every single reasoning model that I
like. So there's just like I think in
education a lot of it needs to be like
at this point it's like what I like
because the language models are so good
at the math where it's like famous paper
direct preference optimization which is
like a much simpler way of pro solving
the problem than RL um the derivations
in the appendix skip steps of math and
it's like I tried for this book like I
redid the derivations and I'm like what
the heck is this log trick that they use
to change the math but doing it with
language models they're like this is the
log trick and I'm like I I don't know if
I like this that the math is so
commoditized. I think like some of the
struggle and reading this appendix and
following the math I think is good for
learning and I
&gt;&gt; Yeah. So we're actually returning to
this often just on the topic of
education. You both have brought up the
word struggle
&gt;&gt; quite a bit. So there is value if you're
not struggling as part of this process.
You're not fully following the the
proper process for learning. I suppose
some of the providers are starting to
work on models for education which are
designed to not give actually I haven't
used them but I would guess they're
designed to not give all the information
at once
&gt;&gt; and make people work to do this. So I
think you could train models to do this
and it would be a wonderful contribution
where like all of this stuff in the book
you have to reevaluate every decision
for it which is such a great example. I
think there's there's a chance you work
on it at AI too which I which I was like
oh I think this be
&gt;&gt; it makes sense. I do something like
that. uh did that the other day for
video games for example I sometimes for
my past time play video games like I
like uh video games with puzzles so you
know like Zelda and Metroid and there's
this new game where I got stuck and I
already got stuck and was okay I you
know I don't want to struggle like two
two days and so I used an LLM but then
you say hey please don't add any
spoilers just you know I'm here and
there what do I have to do next and the
same thing you can do I guess for math
where you say okay I'm here at this
point I'm getting stuck don't give me
the full solution But what is something
I could try you know like where you kind
of carefully probe it but the problem
here is I think it requires discipline
and a lot of people do math for like I
mean there are a lot of people who enjoy
math but there also a lot of people who
need to do it for their homework and
then it's like the shortcut and yeah we
can develop an educational LLM but the
other LLM is still there and there's
still a temptation to use the other LLM.
I think a lot of people, especially in
college, they they understand the stuff
they're passionate about. They're
self-aware about it and they understand
it shouldn't be easy.
&gt;&gt; Like I think we just have to develop a
good taste.
&gt;&gt; Mhm.
&gt;&gt; We talk about research taste, like
school taste about stuff that you should
be struggling on
&gt;&gt; and and stuff you shouldn't be
struggling on, which is tricky to know
cuz sometimes you don't have good um
long-term vision about what would be
actually useful to you in your career.
But you have to you have to develop that
taste. Yeah.
&gt;&gt; I was talking to maybe my fiance or
friends about this and it's like there's
this brief 10-year window where all of
the homework and all the exams could be
digital, but before that everybody had
to do all the exams in Blue Book cuz
there was no other way. And now after
AI, everybody's going to need to be in
Blue Books and oral exams cuz everybody
could cheat so easily. It's like this
brief generation that had a different
education system that like everything
could be digital and but you still
couldn't cheat. And now I was just going
to go back. It's very funny.
&gt;&gt; You mentioned character training. Just
zooming out on on a more general topic
for that topic. How much compute was
required and in general to contribute as
a researcher?
Are there places where not too much
compute is required where you can
actually contribute as an individual
researcher
&gt;&gt; for on the character training thing? I
think this research is built on
fine-tuning about 7 billion parameter
models with Laura which is like a
essentially you only fine-tune a small
subset of the weights of the model. I
don't know exactly how many GPU hours
that would take but it's doable
&gt;&gt; not doable for every academic. So the
situation for some academics is like so
dire that the only work you can do is
doing inference where you have closed
models or open models and you get
completions from them and you can look
at them and understand the models and
that's very well suited to evaluation
which you become ex you want to be the
best at creating representative problems
that the models fail on or show certain
abilities which I think that you can
break through with this. So I've like I
think that the top end goal for a
researcher working on evaluation if you
want to have career momentum is the
frontier labs pick up your evaluation.
So it's like you don't need to have
every project do this. But if you go
from a small university with no compute
and you figure out something that claude
struggles with and then the next cloud
model has it in the blog post like there
there's your career rocket ship. I think
that that's hard but it's like if you
want to scope the maximum possible
impact with minimum compute it's
something like that which is just get
very narrow and it takes learning of
where the models are going. So you need
to like build a tool that tests where
not cloud 4.5 will fail. If you're going
to do a re if I'm going to start a
research project I need to think where
the models in 8 months are going to be
struggling.
&gt;&gt; Well what about developing totally novel
ideas?
&gt;&gt; This is a trade-off. I think that if
you're doing a PhD, you could also be
like it's too risky to work in language
models. I'm going way longer term, which
is like what is
&gt;&gt; what is the thing that's going to define
language model development in 10 years,
which I think that I end up being a
person that's pretty practical. I mean,
I went to my PhD where it's like I got
into Berkeley, worst case, I get a
masters and then I go work in tech. It's
like I'm very practical about it. on
like the life afforded to people to work
at these AI companies. The amount of
like OpenAI's average compensation is
over a million dollars in stock a year
per employee. Any normal person in the
US to get into this AI lab is
transformative for your life. So I'm
pretty practical of like there's still a
lot of upward mobility working in
language models if you're focused and
the outcomes is like look at these jobs.
But from a research perspective, the
transformative impact and these academic
awards and like be the next Yan Lacun is
from not working on not caring about
language model development very much.
It's a big financial sacrifice in that
case.
&gt;&gt; So I get to work with some awesome
students and they're like should I go
work at an AI lab? And I'm like like
you're getting a PhD at a top school or
you're going to leave to go to a lab.
I'm like I don't know. Like if you go
work at a top lab I don't blame you.
Don't go work at some random startup
that might go to zero, but if you're
going to open AI, I'm like, it could be
worth leaving a PhD for. Let's more
rigorously think through this. Where
would you give a recommendation for
people to do a research contribution?
So, the options are academia, so get get
a PhD, spend 5 years publishing.
Computer resources are constrained.
there's uh
there's research labs that are more
focused on open weight models and so
working there or closed frontier labs,
research labs,
&gt;&gt; open AI, anthropic XAI so on.
&gt;&gt; Um the two gradients are the more closed
the more money you tend to get and but
also the le you get less credit. So in
terms of building a like a portfolio of
things that you've done like it's very
clear of what you have done as an
academic and you have done this and
versus if you are going to go be like
trade this fairly reasonable progression
for being a cog in the machine which
could also be very fun. So I think it's
a very different career paths but the
like the opportunity cost for being a
researcher is very high because PhD
students are paid essentially nothing.
So I think it ends up rewarding people
that have a fairly stable safety net and
they realize that they can operate in
the long term which is they want to do
very interesting work and get a very
interesting job. So it is a fairly like
it is a privileged position to be like
I'm going to see out my PhD and figure
it out after because I I want to do this
and I think a lot of acade like at the
same time the academic ecosystem is
getting bombarded by funding getting cut
and stuff. So, there's just like so many
different trade-offs where I understand
plenty of people that are like, "Oh, I'm
I can't deal with this funding search. I
grant got cut for no reason by the
government or I don't know what's going
to happen." So, I think there's a lot of
uncertainty and trade-offs that in my
opinion favor just like take the take
the well-paying job with meaningful
impact. So, it's like not also like
you're getting paid to sit around at
OpenAI. building like the cutting edge
of things that are
&gt;&gt; changing millions of people's
relationship to tech,
&gt;&gt; but publication wise, they're being more
secretive increasingly. So, so you're
publishing less and less and less and
less and so you're you are having a
positive impact at scale, but it's
you're a cog in the machine.
&gt;&gt; I think it's honestly it hasn't changed
that much. Uh so I have been in
academia. I'm not in academia anymore.
At the same time, I wouldn't want to
miss my time in academia. But what I
wanted to say before I get to that part,
I think it hasn't changed that much. I
um was working in um like I was using AI
or machine learning methods um for
applications and computational biology
with collaborators and a lot of people
went from academia directly to Google
and I think it's the same thing back
then. the professors were like you know
sad that their students went into u
industry because they couldn't carry on
their legacy in that sense and I I think
it's the same thing it's like it it
hasn't changed I think that much the
only thing that has changed is the scale
but you know cool stuff was always
developed in industry that was closed
you could couldn't talk about it and I
think the difference now is um well your
preference do you like to talk about
your work publish or you know you you
are more in a closed step the that's one
difference the compensation of course
but it's always been like that I think
so it really depends on you know where
you feel comfortable and it's also
nothing is forever the only thing right
now is there's a third option which is
um starting a startup that's a lot of
people doing startups very risky move uh
but can be high is a high risk high
reward type of situation where joining
an industry lab I think is pretty safe
you know also upward mobility Honestly,
I think if once you have been at a
industry lab, it will be easier to find
future jobs. But then again, you know,
you know, it's like yeah, how much do
you enjoy the team and working on
propriety things versus how do you like
the publishing work? I mean, publishing
is stressful. It is um you know like uh
acceptance rate at conferences can be
arbitrary, can be very frustrating, but
also high reward. If you have a paper pl
published, you feel good because your
name is on there. You have a high
accomplishment and you know
&gt;&gt; I feel like my friends who are
professors seem on average happier than
my friends who work at a frontier lab to
be totally honest cuz that's just
grounding and the frontier labs
definitely do this 996 which essentially
is shorthand for work all the time.
&gt;&gt; Can you describe 996 as culture that's I
believe you could say invented in China
and uh adopted in Silicon Valley. What's
what's 996? Just 9:00 a.m. to 9:00 p.m.
&gt;&gt; 6 days a week.
&gt;&gt; 6 days a week. What is that? 72 hours.
Okay. So, what is this basically the
standard in AI companies in Silicon
Valley? More and more this kind of grind
mindset. Yeah. I mean, not maybe not
exactly like that, but I think there is
a trend towards it. And it's
interesting. I think it almost flipped
because when I was in academia, I felt
like that because uh as a professor, you
had to write grants, you had to do you
had to teach, and you had to do your
research. It's like three jobs in one
and it is more than a full-time job if
you want to be successful. And um I feel
like now like Nathan just said the
professors in comparison to a lab I
think they have less like even maybe
pressure or workload than at a frontier
lab because
&gt;&gt; they work a lot. They're just so
fulfilled by like working with students
and having a constant runway of
mentorship and like a mission that is
very people oriented. I think in an era
when things are moving very fast and are
very chaotic is very rewarding to
people. Yeah. And I think at a startup I
think it's this pressure. It's like you
have to make it and it's like it is
really important that people put in the
time but well it is really hard because
you have to deliver constantly and I've
been at a startup. I had a good time but
I don't know if I could do it forever.
It's like a interesting pace. Uh and is
exactly like we talked about in the
beginning. these models are leaprogging
each other and they are just constantly
like trying to take the next step
compared to the competitors. It's just
ruthless. I think right now
&gt;&gt; I think this leaprogging nature and
having multiple players is actually an
underrated driver of language modeling
process where competition is so deeply
ingrained to people and these companies
have intentionally created very strong
culture like anthropic is known to be so
culturally like deeply committed and
organized. I mean like we hear so little
from them and everybody at anthropic
seems very aligned and it's like being
at a culture that is super tight and
having this competitive dynamic is like
talk about a thing that's going to make
you work hard and create things that are
better. So I think that this that comes
at the cost of human capital which is
like
&gt;&gt; you can only do this for so long and
people are definitely burning out. I
think I I wrote a post on burnout was
like I've tried in and out of this
myself especially trying to like be a
manager full mode training. It's a crazy
job doing this. The book Apple in China
by Patrick McGee. He talked about the
how hard the Apple engineers worked to
set up the supply chains in China and he
was like they had saving marriage
programs and he told in a podcast he was
like people died from this level of
working hard. So I think there just like
it's a perfect environment for creating
progress based on human expense and I
it's there's going to be a lot there's a
lot of the human expense is the 996 that
we started this with which is like
people do really grind. I also read this
book. I think they had a code word for
if someone had to go home to spend time
with their family to save the marriage
and it's crazy. Then colleagues want to
say okay this is like red alert for this
situation. We have to let that person go
home this weekend. And um but at the
same time I don't think they were forced
to work. It's really they were so
passionate about the product I guess
that it is it is you you get into that
mindset and I I had that sometimes as an
academic but also as an independent
person. I have that sometimes I overwork
and it's unhealthy. I had, you know, I
had back issues. I had neck issues
because I did not take the breaks that I
maybe should have taken. But it's not
because no one forced me to. It's
because I wanted to work because
&gt;&gt; that's what Open AI and like they want
to do this work.
&gt;&gt; Yeah. But there's also there's also a a
feeling a fervor that's building
especially in Silicon Valley aligned
with the scaling laws idea where there's
this hype where the world will be
transformed on a scale of weeks and you
want to be at the center of it and then
you know um I'm I have this great
fortune of having conversations with
wide variety of human beings and from
there I get to see all these bubbles and
echo chambers across the world and it's
fascinating to see how we humans form
them. And I think it's fair to say that
Silicon Valley is a kind of echo chamber
uh a kind of u silo and bubble. I think
bubbles are actually really useful and
effective. It's not necessarily a
negative thing cuz it could be ultra
productive. It could be the the Steve
Jobs
reality distortion field cuz you just
convince each other the breakthroughs
are imminent and by convincing each
other of that, you make the
breakthroughs imminent. Mhm.
&gt;&gt; Burn Hobart wrote a book classifying
bubbles but essentially one of them is
financial bubbles which is like
speculation which is bad and the other
one is like I don't know the term but
effectively for buildouts because it
pushes people to build these things and
I do think AI is in this but I worry
about it transitioning to a financial
bubble which is like it's
&gt;&gt; yeah but also in the space of ideas that
bubble
you are doing a reality distortion field
and that means you are deviating from
reality and if you go too far from
reality
while also working you know 996 and you
you might miss some fundamental aspects
of the human experience including in
Silicon Valley and this is a common
problem in Silicon Valley is like it's a
very specific geographic area you might
not understand the Midwest perspective
the full experience of all the other
different humans in the United States
and across the world and you and you
speak a certain way to each other you
convince each other of a certain thing
and that that gets you into real trouble
whether AI is a big success and becomes
a powerful technology or it's not in
either trajectory you can get yourself
into trouble so you have to consider all
of that here you are a young person
trying to decide what you want to do
with your life the thing that is I don't
even really understand this but the SF
AAI memes have gotten to the point where
permanent underclass was one of them
which was the idea that the last 6
months of 2025 was the only time to
build a durable value in an AI startup
or model otherwise all the value will be
captured by existing companies and you
will therefore be poor which like that's
an example of the SF thing that goes so
far I still think for young people that
going to be able to tap into it if you
are really passionate about wanting to
have impact in AI like being physically
in SF is the most likely place where
you're going to do this but it has it
has trade-offs
&gt;&gt; I think SF is an incredible place but
there is a bit of a bubble bubble and if
you go into that bubble which is
extremely valuable just get out also
read history books read literature uh
visit other places in the world Twitter
is not and Substack is not the entire
world I think I would say one of my one
people I worked with is moving to SF and
it's like I need to get him a copy of
the season of the witch which is a
history of SF from like 1960 to 1985
which goes through like the hippie re
revolution like they all the um gays
kind of taking over the city and that
culture emerging and then the HIV AIDS
crisis and other things and it's just
like that is so recent and so much
turmoil and hurt but also like love and
SF and it's like no one knows about
this. It's a great book season of the
witch. I recommend it. A bunch of my SF
friends were who do get out recommended
it to me and I think that it's just like
living there like I lived there and I
didn't appreciate this context and it's
just like so recent. Yeah. Okay. Let's
uh we talked a lot about we talked a lot
about a lot of things. uh certainly
about the things that were exciting last
year but this year uh one of the things
you guys mentioned is exciting is uh the
scaling of tuxed fusion models and just
a different exploration of text fusion.
Can you talk about what that is and what
the possibility holds
&gt;&gt; sort of different kinds of approaches
than the current LMS?
&gt;&gt; Yeah. So we talk a lot about the
transformer architecture and the auto
reggressive transformer architecture
specifically like GPT and it doesn't
mean no one else is working on anything
else. So people are always on the let's
say look out for the next big thing
because I think it would be almost like
um yeah stupid not to because sure right
now the transformer architecture is the
thing and it works best and there's
right now nothing else out there but you
know it's always a good idea to not put
all your eggs into one basket. people
are developing other things alternatives
to the um auto reggressive transformer.
One of them would be for example text
diffusion models and listeners may know
diffusion models from the image
generation like stable diffusion
popularized it. There was like a paper
on generating images. Back then people
used GANs the generative adversarial
networks and then there was this
diffusion process where you iteratively
denoise an image and that resulted in
really good quality images over time.
stable diffusion was a company. Other
companies build their own diffusion
models and then people are now like okay
can we try this also for text doesn't
you know make intuitive sense yet
because it feels like okay uh it's not
something continuous like a pixel that
we can differentiate it's like a
discrete text so how do we implement
that den noising process but it's kind
of like similar to uh the bird models by
Google like when you go back to the
original transformer so they were like
the encoder and the decoder the decoder
is what we are using right now in in GBT
and so forth the encoder it's more like
um a parallel let's say technique where
you have multiple tokens that you fill
in in parallel instead so GPT models
they do auto reggressive one token at a
time you you complete the sentence one
token at a time and in bird models you
have a text let's say sentence that has
gaps uh you like mask them out and then
one iteration is filling in these gaps
and text diffusion is kind of like that
where you are starting with let's say
some random text and then you filling in
the missing parts or you're refining
them iteratively and you have multiple
iterations. And the cool thing here is
that this can do multiple tokens at the
same time. So, it's kind of like the
promise of having it more efficient.
Now, the the trade-off is of course,
well, how good is the quality? It might
be faster. And then now you have this
dimension of the denoising process. The
more steps you do, the better the text
becomes. Um, and people, you know, I
mean, you can scale in different ways.
they try to see if that is maybe a valid
alternative to the auto reggressive
model in terms of giving you the same
quality for less compute. Right now I
think it's you know there are papers
that suggest okay if you want to get the
same quality you uh have to crank up the
den noising steps and then you end up
spending the same compute you would
spend on an auto reggressive model. Um
the other downside is well it's parallel
which sounds appealing but some tasks
are not parallel like you know like
reasoning task tool use maybe where you
have to ask an code interpreter to give
you an intermediate result and that is
kind of tricky with the fusion model. So
there are some hybrids but the main idea
is can we parallelize it and so
interesting avenue I think right now
there are mostly research uh let's say
models out there like Lada and some
other ones I I saw some by startups some
deployed models there is no big uh
diffusion model at scale yet like you
know like Gemini chd scale on that level
but there was an announcement by Google
or like a site where they said they are
launching Gemini diffusion and they put
it into context of their I nano2 model
and then they said basically for the
same quality on most benchmarks we can
generate things much faster. So you
mentioned what's next. I don't think the
text diffusion model is going to replace
auto reggressive LLMs but it will be
something maybe for quick uh cheap at
scale tasks. Maybe the free tier in
future will be something like that.
&gt;&gt; I think there's a couple examples where
it's I I've heard that it's actually
been started to be used. I think to
paint an example of why this is so much
better. For example, when GPT5 is taking
30 minutes to respond, it's generating
one token at a time. And this diffusion
idea is essentially generate all of
those completion, all of those tokens in
the completion in one batch, which is
why it could be way faster. And I think
it could be suited. The startups I'm
hearing are like code startups where you
have a code base and you have somebody
that's effectively vibe coding and they
say make this change and a code diff is
essentially a huge reply from the model
but it doesn't have to have that much
external context and you can get it
really fast by using these diffusion
models. So that's what I've heard of one
example is that they use these text
diffusion to generate really long diffs
because doing it with a auto reggressive
model would take minutes and that time
for like a userfacing product causes a
lot of churn. So like every second you
lose a lot of users. So I I think that
it's going to be this thing where it's
going to grow and have some
applications, but I actually thought
that different types of models were
going to be used for different things
more sooner sooner than they have been.
So I kind of trade off. I think that the
tool use point is the one that's
stopping them from being um like most
general purpose because like cloud code
and this had with search like the re the
auto reggressive chain is interrupted
with some external tool and I don't know
how to do that with the diffusion setup.
&gt;&gt; So what's the future of uh tool use this
year and in the coming years? Do you
think there's going to be a lot of de
developments there how that's integrated
to the entire stack? I do think right
now I mean it's mostly on the
proprietary LLM side. Um but I think we
will see more of that in the open-source
tooling and I think I mean it is a huge
unlock because then you can really
outsource certain tasks from just
memorization to actual
you know like instead of having the LM
memorize what is 23 + 5 just use a
calculator.
&gt;&gt; So you think that can help solve
hallucination?
&gt;&gt; Uh not solve it but reduce it. Um so
still the LLM needs to know what um like
when to ask for a tool call. And the
second one is well it doesn't mean the
internet is always correct. You can do a
web search. But let's say I asked who
won the World Cup in let's say 1998. It
still needs to find the right website
and get the right information. So you
can still go to the incorrect website
and give me incorrect information. So I
don't think it will fully solve that but
it is improving it in that sense. And um
so another cool paper earlier this year
I think it was Janu uh December 31st so
it's not technically 2026 but close. So
like the recursive language model
&gt;&gt; that that's a cool idea to kind of take
this even a bit further. So just to
explain uh so Nathan you also mentioned
earlier it's harder to do cool research
in academia because of the compute
budget. If I recall correctly, they did
everything with GPD5. So they didn't
even use local models. But the idea is
let's say you have a long context task.
Instead of having the LLM solve all of
it in like one shot or even like in a
chain, you break it down into subtasks.
You have the LM decide when like what is
a good let's say subtask and then
recursively call an LLM to solve that.
And I think something like that also
then adding tools and you know each one
maybe you have like a huge Q&A task each
one goes to the web and gathers
information and then you pull it at the
end together and stitch it back
together. Um like where I think there's
going to be a lot of unlock using things
like that where you you don't
necessarily improve the LLM itself, you
improve how the LLM is used and what the
LLM can use. One downside right now with
tool use is you have to give the LLM
permission to use tools and that will
take some trust especially if you want
to unlock things like having an LLM
answer emails for you or not even answer
but just sort them for you or select
them for you or something like that. I
don't know if I would today give an LLM
access to my emails, right? I mean this
is like a huge risk.
&gt;&gt; I think I think there's a cool one last
point on the tool use thing. I think
that you hinted at this and we've both
come at this in our own ways is that the
open versus closed models use tools in
very different ways where open models
people go to hugging face and you
download the model and then the person's
going to be like oh what tool do I want
and I don't know exa is my search
preferred search provider but somebody
else might care for a different search
startup where you release a model it
needs to be useful for multiple tools
for multiple use cases which is really
hard because you're making like a
general reasoning engine model which is
actually what GPOSS is good But on the
closed models, you're deeply integrating
the specific tool into your experience.
And I think that open models will
struggle to replicate some of the things
that I like to do with closed models,
which will be like I don't know, you can
reference a mix of public and private
information. And something that I keep
trying every 3 to 6 months, I try like
codeex on the web, which is just
prompting a model to make an update to
some GitHub repository that I have. And
it's just like like that sort of secure
cloud environment is just so nice for
just like send it off and do this thing
and then come back to me. And these will
probably help define some of the local
open and closed
niches. But I think initially because
there was such a rush to get these tool
use working that the open models were on
the back foot which is kind of
inevitable. I think there's so much
resource so many resources in these
frontier labs but will be fun when the
open models solve this because it's
going to necessitate like a bit more
flexible and potentially interesting
model that might work with this
recursive idea to like be an
orchestrator and a tool used model. So
hopefully the necessity drives some
interesting innovation there. So
continual learning uh this is a
long-standing
topic important problem I think that
increases in importance as the cost of
training of the models goes up. So can
you explain what continual learning is
and how important it might be this year
and in the coming years to make
progress. This relates a lot to this
kind of SF zeitgeist of what is AGI,
what is which is artificial general
intelligence and what is ASI, artificial
super intelligence and what are the
language models that we have today
capable of doing. I think the language
models can solve a lot of tasks but a
key milestone among the AI community is
essentially when AI could replace any
remote worker taking in information and
solving digital tasks and doing them.
And I the limitation that's highlighted
by people is that a language model will
not learn from feedback the same way
that an employee is. So if you hire an
editor, the editor will mess up, but you
will tell them. And if you hired a good
editor, they don't do it again. But
language models don't have this ability
to modify themselves and learn very
quickly. So the idea is if we're going
to actually get to something that is a
true like general adaptable intelligence
that can go into any remote work
scenario, it needs to be able to learn
quickly from feedback and on job
learning.
&gt;&gt; I'm personally more bullish on language
models by being able to just provide
them with very good context. You said
like you maybe offline said that like
you can write extensive documents to
models where you say I have all this
information. Here's all the blog posts
I've ever written. I like this type of
writing. my voice is based on this but a
lot of people don't provide this to
models and the models weren't designed
to like take this amount of context
previously like the aentic models are
just starting so it's this kind of
trade-off of do we need to update the
weights of this model with this
continual learning thing to make them
learn fast or the counterargument is we
just need to provide them with more
context and information and they will
have the appearance of learning fast by
just having a lot of context and being
very smart so we should mention the
terminology here so continual learning
refers to changing the weights
continuously
so that the model adapts adjusts based
on the new incoming information does so
continually rapidly and frequently and
so on. Uh and then the thing you
mentioned on the other side of it is
generally be referred to as in context
learning. As you learn stuff, there's a
huge context window. You can just keep
loading it with extra information every
time you prompt the system, which I
think both are legitimately can be seen
as learning.
&gt;&gt; It's just a different place where you're
doing the learning.
&gt;&gt; I think to be honest with you, continue
learning the the updating of weights.
already have that in different flavors.
I mean, if you think about how so I
think the the distinction here is do you
do that on a personalized custom model
for each person or do you on a global
model scale and I think we have that
already with uh going from GPD 5 to 5.1
and 5.2. It's maybe not immediate but it
it is like a curated update a quick
curated update where uh there was
feedback by the you know things it
couldn't do feedback by the community
they updated the weights next model and
so forth so it is I mean kind of like a
flavor of that um other even finer grade
example a finer grained example is like
RLVR you run it it updates the problem
is you can't just do that for each
person because it would be too expensive
to update the weights for each person
and I think that's the problem so unless
you get I I mean even at OpenI scale
building data centers it would be too
expensive. I think that is only feasible
once you have something on the device
where the cost is on the consumer like
what Apple tried to do with the Apple
foundation models putting them on the
phone and then they learn from the
experience. a bit of a related topic but
uh this kind of uh maybe
anthropomorphized term but memory
&gt;&gt; what are different ideas of the
mechanism of how to add memory to these
systems as you're increasingly seeing so
so personalized memory especially
&gt;&gt; so right now it's mostly like uh context
basically stuffing things into the
context and then just recalling that but
again I think well it's expensive
because you have to like I mean you can
c it But still you spend tokens on that.
And the second one is you can only do so
much. I think it's more like a
preference or a style. I mean a lot of
people do that when they solve math
problems. You say uh it's basically you
can add previous knowledge and stuff but
you also give it certain preference
prompts. Do what I preferred last time
whatever like something like that. But
it does it doesn't unlock uh new
capabilities. So for that one thing
people do use still is Laura Laura
adapters. These are basically instead of
updating the whole weight matrix there
are two smaller weight matrices um that
you kind of have in parallel or overlay
is like the delta but um yeah you you
can do that to some extent but then
again it is economics you so there were
also papers for example Laura learns
less but forgets less it's like you know
it's no free lunch if you want to learn
more you need to use more weights but it
gets more expensive and then again if
you learn more you forget more and it's
like you have to find that goldilocks
zone basically
&gt;&gt; we haven't really mentioned it much but
the implied in this discussion is
context length also is there a lot of
innovations that's possible there
&gt;&gt; I think the colloqually accepted thing
is that it's a compute and data problem
where you can and some of times like
small architecture things which is like
attention variance so if you have we
talked about like hybrid attention
models which is essentially if you have
what looks like a state space model
within your transformer and like those
are better suited because you have to
spend less compute to model the furthest
along token. And I think that but those
aren't free because they have to be
accompanied by a lot of compute or um
the right data. So how many sequences of
100,000 tokens do you have in the world
and where do you get these? And I think
it just ends up being pretty expensive
to scale them. So, we've like gotten to
pretty quickly to like a million tokens
of input context length, and I would
expect it to keep increasing and like
get to like two million or five million
this year, but I don't expect it to go
to like a 100 million. That would be
like a true breakthrough. And I think
those breakthroughs are possible. Like
the continual learning thing I think of
as a research problem where you could
there could be a breakthrough that just
makes transformers work way better at
this and it's cheap. like these things
could happens with so much scientific
attention but turning the crank it'll be
consistent increases over time. I think
also looking at the extremes I think
there's again no free lunch. So the one
extreme to make it cheap you have a
let's say an RNN that has a single spa
state where you save everything from the
previous stuff. It's like a specific u
fixed size thing. So you never really
grow the memory because it's you are
stuffing everything into one state. But
then the longer the context gets, the
more information you forget because you
can't you can't keep I mean compress
everything into one state. Then on the
other hand you have the transformers
which try to remember every token which
is great sometimes if you want to look
up specific information but very
expensive because you have the KV cache
that grows the um dot product that grows
but then yeah like you said the mamba
layers I mean they kind of have the same
problem I would say like an RNN you try
to compress everything into one state
you're a bit more selective there but
then I think it's like this goldilocks
zone again with Neimotron 3 they found
like a good ratio of how many attention
layers do you need for the global
information where everything is
accessible compared to having these
compressed states and I think that's how
I think we will scale more by finding
better let's say ratios in goldilock
zone like between um like computing uh
making it cheap enough to run but then
also making it powerful enough to be
useful and one more plug here um the
recursive language model paper that is
one of the papers that tries to kind of
address the long context thing so what
they found is essentially instead of
stuffing everything into this long
context um if you break it up into these
smaller um multiple smaller tasks so you
save memory by having multiple smaller
calls you can get actually better
accuracy than having the LLM try
everything all at once I mean it's a new
paradigm we will see you know there
might be other flavors of that so I
think with that we will still make
improvement on long context but then
also like Nathan said I think the
problem is for pre-training itself we
don't have as many long context text
documents as other documents. So it's
harder to study basically um how LMS
behave and stuff like that on on that
level. Like there are some rules of
thumb where essentially you pre-train a
language model like we pre-trained to
like 8k context length and then extend
it to 32k with training and there's some
rules of thumb where you're just like
essentially doubling the training
context length takes like 2x compute and
then you can normally like two to fourx
the context length again. So I think a
lot of it ends up being kind of
computebound at pre-training which is
like we talked about this everyone talks
about this big increase in compute for
the top labs this year and that should
reflect in some longer context windows
but I think on the post- training side
there's some more interesting things
which is as we have agents the agents
are going to manage this context on
their own where now people that use
cloud code a lot dread the compaction
which is when claude takes its entire
full 100,000 tokens of work and compacts
it into bulleted list but what the next
models will do I'm just not a novel I'm
sure people are already working on this
is essentially the model can control
when it compacts and how so you can
essentially like train your RL algorithm
where compaction is an action
&gt;&gt; where it shortens the history and then
the problem formulation will be I want
to keep the maximum evaluation scores
that I have gotten while the model
compacts its history to the minimum
length because then you have the minimum
amount of tokens that you need to do
this kind of compounding auto
reggressive prediction so it's actually
a pretty nice problem setups in this
where the like these agentic models
learn to use their context in a
different way than just plow forward.
&gt;&gt; One interesting also recent example
would be Deepseek version 3.2 two where
they had like the sparse attention
mechanism where they have essentially
like a very efficient small lightweight
indexer and instead of attending to all
the tokens it selects okay what tokens
do I actually need it's I mean it's
almost comes back to the original idea
of attention where you are selective but
attention is always on you have maybe
zero weight on some of them but you use
them all but they are even more like
okay let's just mask that out or like
not even do that and even with um
sliding window attention that is also
kind of like that idea, you have that
rolling window where you keep it fixed
cuz you don't need everything all the
time. Occasionally, some layers you
might, but it's wasteful. But right now,
I think, yeah, if you use everything,
you're on the safe side. It gives you
the best bang for the buck because you
never miss information. And right now, I
think this year will be more also the
year figuring out, like you said, how to
be more smart about that. I think right
now people want to have the next
state-of-the-art and the
state-of-the-art is happens to be the
brute force expensive thing and then
once you have that like you said keep
that uh accuracy but let's see how we
can do that cheaper now like tricks you
know
&gt;&gt; yeah all the scaling thing like the
reason we get the quad 4.5 sonnet model
first is because it you can train it
faster and you're not hitting these
compute walls as soon and they can just
try a lot more things and get the model
faster even though the bigger model is
actually better. I think we should say
that there's a lot of exciting stuff
going on in the AI space. Um my mind has
recently been really focused on
robotics. So we have today really almost
entirely didn't talk about robotics. Uh
there's a lot of stuff on image gen
video generation.
Uh I think it's fair to say that the
most exciting research work in terms of
the amount intensity
uh fervor is in the LLM space which is
why I think it's justified for us to
really focus on the LLM that we're
discussing. But it it'd be nice to bring
in some certain things that might be
useful. For example, world models.
There's growing excitement on that. Do
you think there will be any use in this
coming year for world models in the LLM
space?
&gt;&gt; Uh yes. I I do think so also with LLMs
what's an interesting thing here is I
think if we unlock more LLM capabilities
it also automatically unlocks all the
other fields because or not unlocks but
like makes progress faster because you
know a lot of researchers and engineers
use LM like we said for coding so even
if they work on robotics if you like
optimize these LLMs that help you with
coding you know it's like it pays off uh
but then uh yes so world models are
interesting it's basically where you
have the model run a simulation of the
world in a sense like a little toy thing
of the real thing which can again unlock
capabilities um like that are the LM is
not aware of is can simulate things and
uh I think see this is like something I
think LLMs they just happen to work
well by pre-training and then doing the
next token prediction but we could do
this even a bit you know like
sophisticated in a sense so what I'm
saying is like with there's like I think
it was matter a paper kod world models.
Um so where they basically apply the
concept of reward models to LLMs again
where they so instead of just having
next token prediction and verifiable
rewards checking the answer correctness
they also make sure the intermediate
variables are correct you know like it's
kind of like a the model is learning
basically a code environment in a sense
and I think this makes a lot of sense
it's just like expensive to do but uh it
is like making things more sophisticated
like modeling um like modeling the whole
thing not just the result in uh so it
can add more uh value. I remember when I
was a grad student there is a um
so there's a competition called CASP I
think where they do protein structure
prediction like they predict the
structure of a protein that is not
solved yet at that point. So in a sense
this is actually great and I think we
need something like that for LLMs also
where you do the benchmark but no one
does so you hand in the results but no
one knows the solution and then after
the fact someone reveals that but uh
alpha fold when it came out it crushed
uh you know this benchmark I mean there
were also multiple v uh iterations but I
remember the first one um I'm not an
expert in that sub field but the first
one explicitly modeled the physical um
interactions of the you know the physics
of the molecule also like the angles
impossible angles and then in the next
version I think they got rid of this and
so and just with brute force scaling it
up and I think with LMS we are currently
in this brute force scaling because it
just happens to work but I do think also
at some point it might make sense to
bring back this u thing and I think with
world models I think that is where I
think that might be actually quite cool
um I mean yeah and of course also for
robotics that is a completely uh
unrelated from LMS.
&gt;&gt; Yeah. Yeah. In robotics is very
explicitly. So there's the problem of
locomotion manipulation. Locomotion is
much more solved especially in the
learning domain. But there's a lot of
value just like with the initial protein
folding systems bringing in the
traditional model based methods.
&gt;&gt; So you don't it's it's unlikely
&gt;&gt; that you can just learn the manipulation
or the whole body local manipulation
problem end to end.
&gt;&gt; That's the dream. But then you realize
when you look at the magic of the human
hand and the complexity of the real
world, you realize it's really hard to
learn this all the way through the way I
guess AlphaFold 2. I'm excited about the
robotic learning space because I think
it's collectively getting like
supercharged by all the excitement and
investment in language models generally
where they're getting like the
infrastructure for training transformers
which is like a general modeling thing
is becoming like worldclass industrial
tooling where any wherever that was a
limitation for robotics it's just like
way better there's way more compute and
then on top of like they take these
language models and use them as kind of
central units where you can do
interesting explorative work around
something that kind of already works and
then I see it emerging as like kind of
like we talked about hugging face
transformers and hugging face. I think
when I was at hugging face I was trying
to get this to happen but it was too
early is like these open robotic models
on hugging face and be having people be
able to contribute data and fine-tune
them. I think we're much closer now that
the investment in robotics and I think
self-driving cars is related and it
enables this where it's like once you
get to the point where you can have this
sort of ecosystem where somebody can
download a robotics model and maybe
fine-tune it to their robot or share
data sets across the world and there's
some data there's some work in this area
like RTX I think is a few years ago
where people are starting to do that but
I think once they have this ecosystem
it'll look very different and then this
whole post chat chat GBT boom is putting
more resources into that which I think
is a very good area for doing research.
&gt;&gt; This is also resulting in much better,
more accurate, more realistic simulators
being built uh closing the sim to real
gap in the robotic space. But you know
you mentioned a lot of excitement in the
robotics space and a lot of investment.
The downside of that which happens in
hype cycles, I personally believe most
robotics people believe that the it's
not robotics is not going to be solved
at at the time scale that's being kind
of implicit or explicitly promised. And
so what happens when there's all these
robotics companies that spring up and
then they don't have a product that
works, then there's going to be this
kind of crash of excitement which is
nerve-wracking. There's hopefully
something else will come in and keep
swooping in so that the the the
continued development of some of these
ideas keeps going. That's also related
to the continual learning issue
essentially where the real world is so
complex where with LLMs, yeah, you don't
need to really have something learn for
the user because there are a lot of
things everyone has to do. Everyone
maybe wants to I don't know fix their
grammar in their email or code or
something like that. It's it's more
constrained. So you can kind of prepare
the model for that. But preparing the
robot for the real world that's harder.
I mean you have the foundation models
the robotic foundation models but you
can learn certain things like grasping
things but then again I think every
everyone's house is different you know
like it's so different and that is I
think where the robot would have to
learn on the job essentially and I think
that I guess is the the bottleneck right
now like how to you know customizing it
on the fly essentially I do I don't
think I can possibly understate the
importance of the thing that doesn't get
talked about almost at all by robotics
folks or anyone is safety. All the
interesting complexities we talk about
learning all the failure modes and
failure cases everything we've been
talking about LLM sometimes it fails in
it interesting ways all of that is fun
in games in the LLM space in the robotic
space in people's homes across millions
of minutes
billions of interactions you really are
almost allowed to fail never when you
have embodied systems that are put out
there in the real world. You You just
have to solve so many
problems you never thought you'd have to
solve when you're just thinking about
the general robot learning problem.
&gt;&gt; I'm so bearish on inhome learned robots
for consumer purchase. I'm very bullish
on self-driving cars and I'm very
bullish for robotic automation, eg like
Amazon distribution where Amazon has
built whole new distribution centers
designed for robots first rather than
humans. There's a lot of excitement in
AI circles about AI enabling automation
and like mass scale manufacturing. And I
do think that the path to robots doing
that is more reasonable where it's like
a thing that is designed and optimized
to do a repetitive task that a human
could conceivably do but doesn't want
to. And then I'm much much so but it's
also going to take a lot longer than
people probably predict. I think that
the the leap from AI singularity to we
can now scale up mass manufacturing in
the US because we have a massive AI
advantage is one that is
troubled by a lot of political and other
challenging problems.
Let's talk about timelines
uh specifically timelines to AGI or ASI.
Is it fair like as a starting point to
say that nobody really agrees on the
definitions of AGI and ASI?
&gt;&gt; I kind of think there's a lot of
disagreement but among
I've been getting push back where a lot
of people kind of say the same thing
which is like a a thing that could
reproduce most digital economic work. So
like the remote remote worker is a
fairly reasonable example and I think
open AAI's definition is somewhat um
related to that which is like an AI that
can do a lot of econ like a certain
number of economically valuable tasks
which I don't really love as a
definition but I think it could be a
grounding point because um language
models today while immensely powerful
are are not this remote worker drop in
and there are things that you could
think of that are could be done by an AI
that are way harder than remote work
which are like solving a finding an
unexpected scientific discovery that you
couldn't even posit. Which would be an
example of something that somebody says
is like an artificial super intelligence
problem or like s taking in all medical
records and finding linkages across
certain illnesses that people didn't
know or figuring out that some common
drug can treat some niche cancer. Like
they would say that that is like a super
intelligence thing. So these are kind of
natural tears. My problem with it is
that it becomes deeply entwined with
like the quest for meaning of AI and
this religious aspects to it. So there's
kind of different there's different
paths you can take it.
&gt;&gt; And I don't even know if the remote work
is a good definition cuz what exactly is
that? It's like perfect tool use. I
actually I mean I like I don't know if
you like the originally titled AI 27
report. They focus more on code and
research taste. So the the target there
is the superhuman coder. So they have
several several milestone systems.
Superhuman coder, superhuman AI
researcher, then super intelligent AI
researcher and uh then the full ASI
artificial super intelligence. But the
after you develop the superhuman coder,
everything else falls quickly
there. The task is to have a fully
autonomous like automate coding.
So any kind of coding you need to do in
order to perform research is fully
automated and from there humans would be
doing AI research together with that
system and they will quickly be able to
develop a system that's actually can do
the research for you. That's the idea.
And then and initially their prediction
was 227 28 and now they've pushed it
back by 3 to four years to to uh 2031
mean prediction. Probably my prediction
is even beyond 2031.
But at least you can in a concrete way
think about how difficult it is to fully
automate programming.
&gt;&gt; Yeah, I disagree with some of their
presumptions and dynamics on how it
would play out. Um, but I think they did
a good like they did good work in the
scenario defining milestones that are
concrete and to tell a useful story
which is why the reach for this AI 2027
document well transcended Silicon Valley
is because they told a good story and
they did a lot of rigorous work to do
this. I think the camp that I fall into
is that like AI is like so-called jagged
which will be excellent at some things
and really bad at some things. But I
think that when they're close to this
automated software engineer, what it
will be good at is that traditional ML
systems and front end the model is
excellent at. But the distributed ML,
the models are actually really quite bad
at because there's so little training
data on doing large scale distributed
learning and things. And this is
something that we already see and I
think this will just get amplified and
then it's kind of messier in these
trade-offs and then there's like how do
you think AI research works and so on.
So you think basically superhuman coder
is almost unachievable meaning like
because of the jagged nature of the
thing you're just always going to have
gaps in capabilities. I
&gt;&gt; I think it's assigning completeness to
something where the models are kind of
superhuman at some types of code and I
think that will continue and people are
creative so they'll utilize this like
incredible abilities and like to fill in
the weaknesses of the models and move
really fast. There'll always kind of be
this I've perceived for a long time this
dance between the humans are enabling
this thing that the model can't do and
the best the best AI researchers the
ones that can enable this superpower and
I think this aligns like to what we
already see I think like cloud code for
building a website you can stand up a
beautiful website in a few hours or do
data analysis and I don't think it's
it's going to keep getting better at
these things and it'll pick up some new
code skills and stuff that it'll get
along the way and kind of linking to
what's happening in in big tech is This
AI 2027 report is like it leans into the
singularity idea where I think research
is messy and social and largely in the
data in ways that AI models can't
process. But like what we do have today
is really powerful and these tech
companies are all collectively buying
into this with tens of billions of
dollars of investment. So like we are
going to get some much better version of
chatbt a much better version of cloud
code than we already have. I think that
it's just like hard to predict where
that is going. But the like
bright clarity of that future is why
some of the most powerful people in the
world are putting so much money into
this. And I think it's just kind of
small differences between like we don't
actually know what a better version of
Chad GBT is, but also like can it
automate AI research? I would say
probably not at least in this time
frame. like big tech is going to spend
$und00 billion much faster than we get a
automated AI researcher that enables a
AI research singularity.
&gt;&gt; So you think your prediction would be
what like if this is even a useful
milestone we're more than 10 years out.
&gt;&gt; I would say less than that on the
software side but I think longer than
that on the things like research. Let's
just like
for fun try to imagine a world where all
software writing is fully automated.
Like can you imagine that world?
&gt;&gt; By the end of this year the amount of
software that'll be automated will be so
high. But it's like it'll be the things
of like you're trying to train a model
with RL and you need to have multiple
bunches of GPUs communicating with each
other. That'll still be hard, but I
think it'll be much easier. One of the
ways to think about this, so the full
automation of programming
is just think of like lines of useful
code written. The fraction of that to
the number of humans in the loop. So
presumably there'll be for a long time
humans in the loop of software writing
is just be fewer and fewer relative to
the amount of code written, right? And
the the the SC superhuman coder I think
the the presumption there is it goes to
zero the number of humans in the loop.
What does that world look like when the
number of humans in the loop is in the
hundreds not in the hundreds of
thousands?
&gt;&gt; I think software engineering will be
driven more to system design and goals
of outcomes where I do think software is
largely going to be come on. I think
this has been happening over the last
few weeks where people have gone from a
month ago of like oh AI agents are kind
of slopp which is a famous carpety quote
to like the what is a little bit of a
meme of like the industrialization of
software when anyone can just create
software at their fingerprints like I do
think we are closer to that side of
things and it takes direction and like
understanding how the systems work to
extract that best from the language
models and I think it's hard to like
accept the gravity of how much is going
to change with software development and
how many more people can do things
without ever looking at it. I think
what's interesting is to think about
whether these systems will be um
independent like completely independent
in the sense that well I have no doubt
that LMS will kind of at some point
solve coding in a sense like calculators
solve calculating right so at some point
humans developed a tool that you know
you never need a human to calculate that
number you just type it in and it's an
algorithm you you can do it in a in that
sense and I I think that's the same
probably for coding but the question is
so I think what will happen is yeah you
just say build that website, it will
make a really good website and then you
maybe refine it. But will it do things
independently where so will you be still
having humans asking the AI to do
something like will there be a person
say build that website or will there be
AI that just builds websites or
something or whatever?
&gt;&gt; I think using talking about building
websites is the
&gt;&gt; too simple. It's just like there there's
the the the problem with websites and
the problem with the web, you know, HTML
and all that kind of stuff.
&gt;&gt; It's very resilient to just
&gt;&gt; slop. It will show you slop as good as
showing slop.
&gt;&gt; I would rather like think of like safety
critical systems like uh asking AI to
end to end generate
something that manages logistics
&gt;&gt; or manages cars, a fleet of cars, all
that kind of stuff. So end to end
generates that for you. I think a more
intermediate example is take something
like Slack or Microsoft Word. I think if
the organizations allow it, AI could
very easily implement features end to
end and do a fairly good job for like
things that you want to try. You want to
add a new like tab in Slack that you
want to use. And I think AI will be able
to do that pretty well.
&gt;&gt; Actually, that's a really great example.
How far away are we from that?
&gt;&gt; Like
this year.
&gt;&gt; See, I don't I don't know. I don't know.
I I guess I don't know how bad
production code bases are, but I think
that
&gt;&gt; within like on the order of low years, a
lot of people are going to be pushed to
be more of like a designer and product
manager where you have multiple of these
agents that can try things for you and
they might take one to two days to
implement a feature or attempt to fix a
bug and you have these dashboards which
I think Slack is actually a good
dashboard where your agents will talk to
you and you'll then give feedback. But
things like like I make a website, it's
like you want to make a logo that's
passable. Like I think these like
cohesive design things and this style is
going to be very hard for models and
deciding on what to add the next time.
&gt;&gt; I just Okay, so I hang out with a lot of
programmers and some of them are
a little bit on the skeptical side in
general. That's just vibe wise they're
like that.
I just think there's a lot of complexity
involved in adding features to complex
systems. Like if you look at the browser
Chrome,
if I wanted to add a feature, if I
wanted to have tabs as as opposed to up
top, I want them on the left side.
&gt;&gt; Mhm.
&gt;&gt; interface, right? You I think we're not
this is not a next year thing. One of
the Claude releases this year, one of
their tests was we give it a piece of
software and leave Claude to run to
recreate it entirely. And it could
already almost rebuild scr like slack
from scratch just given the parameters
of the software and left in a sandbox
environment.
&gt;&gt; The scratch part I like almost better.
&gt;&gt; So it might be that the smaller newer
companies are advantaged and they're
like we don't have to have the bloat and
complexity and therefore this future
exists.
&gt;&gt; And I think this gets to the point where
you mentioned that some people are uh
you talk to are skeptical and I think
that's not because the LM can't do XYZ.
It's because people don't want it to do
it this way.
&gt;&gt; Some of that could be a skill issue on
the human side. Unfortunately, we have
to be honest with ourselves and some of
that could be an under specification
issue. So, programming
like you're like you're just assuming
this is like in in relationships and
friendships communication type of issue.
You're assuming the LM somehow is
supposed to read your mind. I think this
is where spec driven design is really
important. like you just using natural
language specify like what you want. I
think that's like if you talk to people
at the labs they use these in their
training in production code like cloud
code is built with cloud code and they
all use these things extensively and
Dario talks about how much of collage
code own and it's like these people are
slightly ahead in terms of the
capabilities they have and they
&gt;&gt; probably spend on inference they could
spend 10 to 100 plusx as much as we're
spending like we're on a lowly$100 or
$200 a month plan like they truly let it
rip and I think that that like with the
pace of progress that we have it seems
like like where a year ago we didn't
have cloud code and we didn't really
have reasoning models and it's like the
difference between sitting here today
and what we can do with these models and
it seems like there's a lot of like
there's a lot of low hanging fruit to
improve them. The failure modes are
pretty dumb. It's like Claude, you tried
to use the CLI command I don't have
installed 14 times and then then I sent
you the command to run. It's like that
thing from a modeling perspective is
pretty fixable.
&gt;&gt; So I I
&gt;&gt; I agree with you. I've be been uh
becoming more and more bullish in
general. Speaking to what you're
articulating, I think it is a human
skill issue. So Anthropic is leading the
way in or other companies in
understanding how to best use the models
for programming. Therefore, they're
effectively using them. I think there's
a lot of programmers on the outskirts.
They're like they don't I mean there's
not a really good guide on how to use
them. People are trying to figure it out
exactly.
&gt;&gt; It might be very expensive. Like it
might be that the entry point for that
is $2,000 a month which is only tech
companies and rich people which is like
like that could be it.
&gt;&gt; But it might be worth it. I mean if if
if the final result is is a working
software system might be worth it. But
by the way it's funny how we converge
from the discussion of timeline to AGI
to something more pragmatic and and
useful. Is there anything concrete and
interesting and useful and profound to
be said about timeline to AGI and ASI or
are these discussions a bit too
detached from the day-to-day? There's
interesting bets. So there's a lot of
people trying to do reinforcement
learning with verifiable rewards but in
real scientific domains where there's
startups that are spending like they
have hundreds of millions of dollars of
funding and they have wet labs where
they're having language models propose
hypotheses that are tested in the real
world. And I I would say that I think
they're very early or they're early, but
with the pace of progress, it's like
&gt;&gt; yeah,
&gt;&gt; maybe they're early by six months and
they make it because they were there
first or maybe they're early by eight
years. You don't really know. So I think
that that type of moonshot to um branch
this momentum into other other sciences
is like okay like that would be very
transformative if like alphafold moments
happen in all sorts of other scientific
domains by like a startup solving this.
I think there are startups I think maybe
harmonic is one where they're going all
in on language models plus lean for
math. I think you had another podcast
guest who talked about this recently and
it's like we don't know exactly what's
going to fall out of spending $100
million on that model and most of them
will fail but a couple of them might be
big breakthroughs that are very
different than tryt or cloud code type
software experiences like a tool that's
only good for a PhD mathematician but
makes them 100x effective like
&gt;&gt; okay I agree I think this will happen in
a lot of domains especially also like
domains that have a lot of um you know
resources like finance and legal and
pharmaceutical companies but then again
is it really AGI again because we are
now specializing it again and then again
is it really that much different from
back in the day how we had specialized
algorithms I think it's just the same
thing more way more sophisticated but I
don't know is there a threshold when we
call it AGI I guess I think the the real
cool thing is here that we have like the
foundation models that we can
specialize. I think that that's like the
breakthrough at some point. Right now, I
think we're not there yet because well,
first it's too expensive, but also you
know like chd doesn't just give away
their chbd to customize it. I think once
that's going to be true in some and I
think I can imagine this as a business
model that JGPD open says at some point
like hey you know Bank of America for
100 million we will do your custom model
or something like that and I think that
will be the huge economic uh value ad.
The other thing though is also companies
I mean right now what is the
differentiating factor I mean if
everyone uses the same LLM if everyone
uses JPD they will all do the same thing
again I mean then well it's everyone is
moving in lock step but usually
companies they want to have a
competitive advantage and I think
there's no way around using some of
their private data and experimenting and
maybe specializing it's going to be
interesting yeah
&gt;&gt; sitting in the pace of progress it does
just feel like things are coming. I
don't think the AGI and ASI thresholds
are particularly
useful. I I think I guess the real
question and this takes us to the remote
worker thing is when are we going to see
a a big obvious leap in e economic
impact cuz currently there's not been an
obvious leap in economic impact of LLM
models for example and that's you know
aside from AGI or ASI or all that kind
of stuff there's a real question of like
when are we going to see a GDP like
&gt;&gt; Mhm. Mhm.
&gt;&gt; Junk.
&gt;&gt; Yeah. It's like what is the GDP made up
of? Like a lot of it is like financial
services. So like
&gt;&gt; I don't I don't know what this is. It's
just hard for me to think about the
&gt;&gt; GDP bump. But like I say that software
development becomes valuable in a
different way when you no longer have to
look at the code anymore. So when when
it is like cloud will make you a small
business which is essentially cloud can
set up your website, your bank account,
your email and your whatever else and
like you just have to express like what
you're trying to put into the world like
that's not just a enterprise market but
it is a hard like I don't know how you
get people to try doing that. I guess if
Chad GPT can do it like people are
trying Chad GPT.
&gt;&gt; I think it boils down to the the
scientific question of how hard is tool
use to solve
because a lot of the stuff you're
implying the remote work stuff is to
tool use. It's like how computer use
like how you have an LM that goes out
there this agentic system and does
something in the world and only screws
up 1% of the time. Computer use is a
good example of what labs care about and
we haven't seen a lot of progress on. We
saw multiple demos in 2025 of like cloud
can use your computer or open AAI had
kua and they all suck.
&gt;&gt; So like they're also investing money in
this and I think that'll be a good
example where that's actually something
where it just seems
&gt;&gt; pretty like taking over the whole screen
seems a lot harder than having an API
that they can call in the back end. And
some of that is you have to then set up
a different environment for the model to
work in. Like they're not working on
your MacBook. They are individually
interfacing with Google and Amazon and
Slack and they handle all these things
in a very different way than humans do.
So some of those might be structural
blockers. Also like specification wise,
I think the problem is also for you know
arbitrary tasks. Uh well you still have
to specify what you want your LLM to do
and how do you do that in a
what is the environment? How do you
specify? You can say what the end goal
is, but if it can't solve the end goal
with LLMs, if you ask it for text, you
can always, you know, clarify, do
substeps. What is how how do you put
that information into a system that
let's say books a travel trip for you?
You can say, well, you screwed up my
credit card information. But even to get
it to that point, like how do you like
as a user guide the model before like it
can't even attempt that? I think the
interface is really hard. Yeah, it has
to learn a lot about you specifically
&gt;&gt; and about this goes to continue continue
learning about the general mistakes that
are made throughout and then mistakes
that are made through you.
&gt;&gt; All the AI interfaces are getting set up
to ask humans for input
&gt;&gt; and then cloud code we talked about a
lot. It asks feedback on questions. If
it doesn't have enough specification on
your plan or your desired, it starts to
ask questions. Would you rather? Um, we
talked about memory which saves across
chats which it's first implementation is
kind of odd where it be like it'll
mention my dog's name or something like
in a chat. I'm like you didn't need to
be subtle about this like I don't care.
But the like things that are emerging
are chat GBT has the pulse feature which
is like um a curated couple paragraphs
with links to something to look at or to
talk about and people talk about how the
language models are going to ask you
questions which I think is a very it's
probably going to work. The language
model is like it knows you had a doctor
appointment or something. It's like,
hey, how are you feeling after that?
Which is like
&gt;&gt; um again goes into the territory of
humans are very susceptible to this and
there's a lot of social change to come,
but also like they're experimenting with
having the models engage. Some people
really like this pulse feature, which is
it processes your chats and
automatically searches for information
and puts it in the chat GBT app. So,
there's a lot of things coming.
&gt;&gt; I used that feature before and I always
feel bad because it does that every day
and I rarely check it out. It's like how
much money like I mean compute is burned
on something I don't even look at you
know where it's like it's kind of
&gt;&gt; there's also a lot of idle compute in
the world so don't feel too bad.
&gt;&gt; Okay. Do you think uh new ideas might be
needed? Is it possible that the path to
AGI whatever that is however we define
that to solve computer use more more
generally to solve
biology and chemistry and physics sort
of the Dario definition of AGI or
powerfully do you think is possible that
totally new ideas are needed non LLM non
RL ideas what might they look like this
is we're now going into philosophy land
a little
for something like a singularity to
happen I would say yes and then new
ideas could be architectures or training
algorithms which is like fundamental
deep learning things but there in that
nature pretty hard to predict and I but
I think we will get very far even
without those advances like we might get
this software solution but it might stop
at software and not do computer use
without more innovation so I think that
it's like a lot of progress would be
coming but in if you're going to zoom
out. Like there are still ideas in the
next 30 years that are going to look
like that was a major like scientific
innovation that enabled the next chapter
of this and I don't know if it comes in
one year or in 15 years.
&gt;&gt; Yeah. I wonder if the bitter lesson
holds true for the next 100 years what
that looks like. If scaling laws are
fundamental in deep learning, I think
the bitter lesson will always apply
which is compute will become more
abundant. But even within abundant
compute, the ones that have a steeper
scaling loss slope or a better offset,
like this is a 2D plot of performance
and compute and like even if there's
more compute available, the ones that
get 100x out of it will win.
&gt;&gt; It might be something like literally
computer clusters orbiting Earth
with solar panels.
&gt;&gt; The problem with that is heat
dissipation. So you get all the
radiation from the sun and you don't
have any air to dissipate heat. But
there is a lot of space to put clusters.
There's a lot of solar energy there and
you could figure out the heat
dissipation. But there is a lot of
energy and there probably could be
engineering will to solve the heat
problem. So there could be
&gt;&gt; is it possible and we should say that it
definitely is possible. How likely is uh
is the question that we're basically
going to be plateauing this year. Not in
terms of the system capabilities but
what the system capabilities actually
mean for human civilization. So on the
coding front, really nice websites will
be built. Uh very nice autocomplete.
&gt;&gt; Mhm.
&gt;&gt; Very nice uh way to understand code
bases and maybe help debug, but really
just a a very nice helper on the coding
front. It can help research
mathematicians do some math.
It can help you with shopping. It can
help you with it can help. It's a nice
helper. It's clippy on steroids.
Uh what else? It may be a good education
tool and all that kind of stuff, but
computer use turns out extremely
difficult to solve. So I'm trying to be
I'm trying to frame the cynical case in
all these domains where it kind of
there's not a really huge economic
impact. We realize how costly it is to
train these systems at every level. Both
the pre-training on the inference, how
costly the inference is, the reasoning,
all of that.
uh like is that possible and how likely
is that do you think
&gt;&gt; when you look at the models there's so
much obvious things to improve and it
takes a long time to train these models
and to do this art and that it'll take
us with the ideas that we have multiple
years to actually saturate in terms of
whatever benchmark or performance we are
searching for it might serve very narrow
niches like the average chatbt 800
million user might not get a lot of
benefit out of this but it is going to
serve different populations by getting
better at different things.
&gt;&gt; Well, I I think what everybody's chasing
now is the is uh a general system that's
useful to everybody. So, okay. So, if
that's not that can plateau, right?
&gt;&gt; I think that dream is actually kind of
dying as you talked about with the
specialized models where it's like and
multimodal is often like video
generation is a totally different thing.
&gt;&gt; That dream is kind of dying is a big
statement. because I don't know if it's
dying. I don't know if every I don't if
you ask the actual financial lab people
they I mean they're still chasing it,
right?
&gt;&gt; I do think they are still like rushing
to get the next model out which will be
much better than the much is a relative
of term but will be better than the
previous one and I I I can't see them
slowing down. I just think the gains
will be made or felt more through not
only scaling the model but now find so I
I feel like there's a lot of tech depth.
It's like well let's just put the better
model in there and better model better
model and now people are okay let's also
at the same time improve everything
around it to like you know like the
engineering of the context and inference
scaling and I the big labs will still
keep doing that and now also the smaller
labs will catch up to that because now
it's just like they are hiring more
there will be more people LLMs it's kind
of like you know like a circle they also
make them more productive and it it's
just it's like amplify I think what we
can expect this amplification but not
like a a change of like a paradigm
change. I don't think that is true but
everything will be just amplified and
amplified and amplified and I can see
that continuing for a long time. Yeah.
&gt;&gt; Yeah. I guess my statement with the
dream is dying depends on exactly what
you think it's going to be doing. Like
cloud code is a general model that can
do a lot of things but it's not like
necessarily like it depends a lot on
integrations and other things. Like I
bet cloud code can do a fairly good job
of doing your email and the hardest part
is figuring out how to give the
information to it and how to get it to
be able to send your emails and stuff
like this. But that's just kind of like
I think it goes back to like what is the
one model to rule everything ethos which
is just like a thing in the cloud that
handles your entire digital life and is
way smarter than everybody. It's like
it's operating in a
it's a it's an interesting leap of faith
to go from cloud code becomes that which
it like in some ways is
there's some avenues for that but I do
think that like the rhetoric of the
industry is a little bit different.
&gt;&gt; I think the immediate also thing we will
feel next as a normal person using LLM
is will probably be related to something
like also trivial like making figures.
Right now, LM are terrible at making
figures. Is it because we are getting
served the cheap models with very less
like lesser inference compute than
behind the scenes? Maybe some like there
are some cranks where you can already
get better figures. But if you ask
today, I don't know, draw a flowchart of
XYZ. It's most of the time terrible and
it is kind of like a very simple task
for a human. I think it's almost easier
sometimes to draw something than to
write something. Yeah, the multimodal
understanding does feel like something
that is odd that it's not better solved.
I
&gt;&gt; I think we're not saying one actually
obvious thing that we're not actually
realizing that's a gigantic thing that's
hard to measure, which is making all of
human knowledge accessible to the entire
world. Like we I don't I one of the
things I think is hard to articulate,
but there's just a huge difference
between Google search and an LLM. Like I
feel like I can basically ask an LLM
anything and get an answer and less is
doing less and less and less
hallucination. And that means
understanding my own life, figuring out
a career trajectory, figuring out how to
solve the problems all around me. Uh
learn about anything through human
history that like I I feel like nobody's
really talking about that because they
just immediately take it for granted
that it's just this is awesome. That's
why everybody's using it is cuz you get
answers for stuff and like the impact of
that across time. Like think about this
is not just in United States, it's all
across the world. Like kids throughout
the world being able to learn these
ideas like the impact that has across
time is is probably that's where the
real like talking about GDP. It won't be
like a leap. It'll be that's how we get
to Mars. That's how we build these
things. That's how we have a a million
new open AIs. All the kind of innovation
that happens from there. And that's just
this quiet force that permeates
everything, right? Human knowledge.
&gt;&gt; I do agree with you. And in a sense, uh
you make it makes knowledge more
accessible. But um it also I think
depends on what the topic is. For
something like math, um in a sense you
can ask it questions, it answers. But if
you want to learn a topic from scratch,
uh I think that that again like we
talked about this earlier, I think the
sweet spot is I mean there are really
good math textbooks where someone laid
it out linearly and that is like a let's
say proven strategy to learn this topic
and it does make sense if you start from
zero to ramp up to get like a like a
information dense text to soak it up.
But then you use the LLM to make
infinite exercises like you you have
problems in a certain area in or have
questions something's unc uncertain or
like you are uncertain about certain
things you ask it to generate example
problems you solve them and you have
questions and then maybe you need you
need more background knowledge and you
ask it to to generate that and I think
but then the it won't give you anything
let's say that is not in the than the
textbook it's just packaging it
differently if if that makes sense but
then there are I feel like where it also
adds value in a more I mean timely sense
where there is no good alternative
besides a human doing it on the fly. For
example, if you I don't know like let's
say you're planning to go to Disneyland
and you try to figure out which tickets
to buy for which park when. Well, there
is no textbook on that. There is no
information dense resource on that.
There's only the sparse internet and
then there is a lot of value in the LLM.
You just ask it. It has you have the
constraints. I'm traveling these and
these days. I want to go there and
there. Please figure out what I need
when and from where and what what it
costs and stuff like that. And it it is
very customized on the fly uh package.
And then this is like one of the
thousand examples and exercise
personalized uh personalization
essentially like pulling information
from the sparse internet the
non-information dense thing where
there's no better version that exists.
It just doesn't exist. You make it from
scratch almost.
&gt;&gt; And if it does exist, it's full of uh
speaking of Disney World like full of
what would you call it? Ad slop.
&gt;&gt; Like you just it's impossible. Uh here
you go any city in the world.
&gt;&gt; What are the top 10 things to do?
&gt;&gt; LM is just way better to ask than
anything on the internet.
&gt;&gt; Well, for now, that's cuz they're
massively subsidized and they're going
to be paid for by ads.
&gt;&gt; It's coming.
&gt;&gt; No. Oh, I hope there I mean I'm hoping
there's a very clear indication of
what's an ad and what's not an ad in
that context. But
&gt;&gt; I did a little I mean that's something I
mentioned a few years ago. It's like uh
I don't know if you're looking for a new
running shoe. Well, is this a
coincidence that Nike maybe comes up
first? Maybe, maybe not. And but I think
there are clear laws around this. You
have to be clear about that. But I think
that's what everyone fears. It's like
the subtle um you know subtle message in
there or something like that. But it
also brings us to the topic of I guess
ads uh where I think this was a thing
openi tried to launch in 2025 and uh
just to because I think it's still not
uh making money in that other way right
now. So that like having really like ad
spots in there and then the thing though
is they couldn't because well there are
alternatives without ads and people
would just flock to the other products
and it also is it's just like crazy how
yeah like they're one uping each other
spending so much money to just get the
users I think so like some Instagram
ads. I don't use Instagram, but I
understand the
appeal of paying a platform to find
users who will genuinely like your
product. And that is the best case of
things like Instagram ads. But there are
also plenty of cases where advertising
is very awful for incentives. And I
think that a world where the power of AI
can integrate with that positive view of
like I am a person and I have a small
business and I want to make the best I
don't know damn steak knives in the
world and I want to sell them to
somebody who needs them. And if like if
AI can make that sort of advertising
thing work even better that's very good
for the world especially with like
digital infrastructure because that's
how like the modern web has been built.
But that's not to say like
addicting feeds so that you can show
people more content is a good thing. So
it's like I think that's even what open
I would say is they want to find a way
that can make the monetization outside
of ads while still giving their users
agency. M
&gt;&gt; and I'm I personally would think that
Google is probably going to be better at
figuring out how to do this cuz they
have they already have ad supply and
they figure out how to turn this demand
in their Gemini app into useful ads then
they can turn it on and somebody will
figure I don't know if I think it's this
year but there will be experiments with
it. I do think what holds companies back
right now is really just that the
competition is not doing it. It's more
like more like a reputation thing. It's
just like I think people are just afraid
right now like ruining or like losing
the reputation, losing users because it
is it would make headlines if someone
launched these ads.
&gt;&gt; Unless they were great, but the first
ads won't be great because it's a hard
problem that we don't know how to solve.
&gt;&gt; Yeah, I think also the first version of
that will likely be something like on X
like the timeline where you have like a
promoted post sometimes in between. It
will be something like that where it
will say like promoted or something like
small and then there will be an image or
something. I think right now the problem
is who makes the first move. If we go 10
years out, the proposition for ads is
that you will make so much money on ads
by having so many users that you can use
this to fun better R&D and make better
models, which is why
&gt;&gt; like YouTube is dominating the
&gt;&gt; market for any like Netflix is scared of
YouTube. Like they have the ad like they
make I don't I pay $28 a month for
premium. They make at least $28 a month
off of me and many other people. and
they're just like creating such a
dominant position in video. So, I think
that's the proposition, which is that
ads can make you have a sustained
advantage in what you're spending per
user, but there's so much money in it
right now that it's like like somebody
starting that flywheel is scary cuz it's
a long-term bet.
Uh, do you think there'll be some like
crazy big moves this year business-wise
like somebody like Google or Apple
acquiring Anthropic or something like
this? Dario will never sell, but we are
starting to see some types of
consolidation with like Grock for $20
billion and um Scale AI for almost 30
billion and countless other deals like
this that they're structured in a way
that is actually detrimental to the
Silicon Valley ecosystem, which is this
sort of licensing deal where not
everybody gets brought along rather than
a full acquisition that benefits the
rank and file employee by getting their
stock vested. Like that's a big issue
for Silicon Valley culture to address
because the startup ecosystem is the
lifeblood where if you get a if you join
a startup even if it's not that
successful your startup very well might
get acquired on a cheap premium of it
and you'll get paid out for this equity
and these licensing deals are
essentially taking the top talent a lot
of the times. I think Grock, the deal
for Grock to Nvidia is rumored to be
better to the employees, but it is still
this antitrust avoiding thing, but I
think that this trend of consolidation
will continue. I've been me and many
smart people I respect have been
expecting consolidation to have happened
sooner, but it seems like some of these
things are starting to turn, which but
at the same time, you have companies
raising ridiculous amounts of money for
&gt;&gt; reasons that you don't like. I'm like, I
don't know why you're taking that money.
So, it's maybe like mixed this year, but
some consolidation pressure is starting.
&gt;&gt; What kind of surprising consolidation do
you think we'll see? So, you say
Anthropic is is a never. I mean, Grock
is a big one. Grock with a Q by the way.
&gt;&gt; Yeah, there's just a lot of startups and
there's a very high premium on AI
startups. So there's a lot of like there
can be a lot of 10 billion range
acquisitions which is a really big
acquisition for a startup that was maybe
founded a like a year ago. I think
Menace AI from this company that's based
in Singapore that Meta founded was
founded eight months ago and then had a
$2 billion exit and I think that there
will be some other big like many billion
dollar acquisitions like yeah like
people rumored them to Apple. I think
there's a lot of pressure and liquidity
in AI. There's pressure on big companies
to have outcomes and I I would guess
that a big acquisition gives people
leeway to then tell the next chapter of
that story.
&gt;&gt; I mean, yeah, I guess cursor, we've been
talking about code and somebody acquires
cursor.
&gt;&gt; They're in such a good position by
having so much user data.
&gt;&gt; Yeah.
&gt;&gt; And we talked about continual learning
and stuff. They had one of the most
interesting like two sentences in a blog
post which is that they had their new
composer model which was a fine tune of
one of these large mixture of expert
models from China. You can know that by
asking gossip or because the model
sometimes responds in Chinese which none
of the American models do. And they had
a blog post where they're like we're
updating the model weights every 90
minutes based on real world feedback
from people using it which is like the
closest thing to real world RL happening
on a model. It's just like in one of
their blog posts which is super cool.
&gt;&gt; And and by the way, I should I say I use
Composer a lot because one one of the
benefits it has is it's fast.
&gt;&gt; I need to try it because everybody says
this
&gt;&gt; and there'll be some IPOs potentially.
You think Anthropic, OpenAI, XAI,
&gt;&gt; they can all raise so much money so
easily that they don't feel a need to
like so long as fundraising is easy,
they're not going to IPO because public
markets apply pressure. I think we're
seeing in China that the ecosystem is a
little different with both Miniax and
Z.AI. AI applying for um filing IPO
paperwork, which will be interesting to
see how the Chinese market reacts. I
actually would guess that it's going to
be like similarly hypy to the US so long
as all this is going and not based in
the realities that they're both losing a
ton of money. I wish more of the
American gigantic AI startups were
public because it would be very
interesting to see how they're spending
their money and have more insight and
also just to give people access to
investing in these cuz I think that
they're some of the most like formid
like they're the companies of the era
and the tradition is now for so many of
the big startups in the US to not go
public. It's like we're still waiting
for Stripe and they IPO but Data Bricks
definitely didn't. they raised like a
series G or something and I just feel
like it's a kind of a weird equilibrium
for the market whereas like I would like
to see these companies go public and
evolve in that way that a company can.
&gt;&gt; You think 10 years from now some of the
frontier model companies are still
around anthropic open AI? I definitely
don't see it to be a winner takes all
unless there truly is some algorithmic
secret that one of them finds the
glasses flywheel because the development
path is so similar for all of them.
Google and OpenAI have like all the same
products and then like Anthropic is more
focused but when you talk to people it
sounds like they're solving a lot of the
same problems. So I think and there's
offerings that will spread out. There's
a lot of it's a very big cake that's
being made that people are going to take
money out of. I I don't want to
trivialize it but uh so OpenAI and
Anthropica are primarily LLM service
providers and some of the other
companies like Google and XAI linked to
X
&gt;&gt; does other stuff too
&gt;&gt; and so it's very possible if AI becomes
more commodified
that the companies that are just
providing LLM will die.
&gt;&gt; I think they will the advantage they
have they have a lot of users and I
think they will just pivot. I think um
then uh if they figure out it's like
anthropic I think pivoted I don't think
they originally um planned to work on
code but it happened that they found
okay this is like a nice niche and now
we are comfortable in this niche and we
push on this niche and I can see the
same thing once maybe let's say
hypothetically speaking I I'm not sure
if it will be true but let's say Google
takes all the market share of the
general chatbot maybe open I will be
then focus on some other sub topic like
the they have too many users to go away
in foreseeable future. I think
&gt;&gt; I think Google is always ready to say
hold my beer with AI mode.
&gt;&gt; I think that the question is if the
companies can support the valuations. I
think I see the AI companies being
looked at in some ways like AWS, Azure
and GCP are all competing in the same
space and all very successful
businesses. There's a chance that the
API market is so unprofitable that they
go up and down the stack to products and
hardware. they have so much cash that
they can build power plants and build
data centers which is a durable
advantage now but there's also just a
reasonable outcome that these APIs are
so valuable and so flexible for
developers that they become little of
like a something like AWS but AWS and
Azure are also going to have these APIs
so there's some like that's a like five
or six people competing in the API
market is hard so maybe like that's why
they get squeezed out you mentioned RIP
Llama is there a path to winning for
meta I think nobody knows. They're
moving a lot. So they're signing
licensing deals with um Black Forest
Labs which is an image generation or
midjourney or applying ma mainness. So I
think in some ways it's on the product
and like consumerf facing AI front. It's
too early to tell. I think they have
some people that are excellent and very
motivated being close to Zuckerberg. So
I think that there's still a story to
unfold there. Llama is a bit different
where Llama was the most focused
expression of the organization and I
don't see Llama being um supported to
that extent. I think it was a very
successful brand for them. So they still
might do some part of participation in
the open open ecosystem or continue the
llama brand into a different surface
because people know what llama is.
&gt;&gt; You think there's a llama 5?
&gt;&gt; Not an openw weight one. It's
interesting. I think also just to recap
a bit I think I mean Llama was the I
would say pioneering open weight model
and then Llama 123 a lot of love but I
think then I think what happened just
hypothesizing or speculating I think the
um leaders at Meta like the upper
executives they I think they got really
excited about Llama because they saw how
popular it was in the community and then
I think the problem was trying to let's
say monetize the open not monetize the
open source but like kind of use the
open source to make a bigger splash in a
s like to kind of force it almost it
felt forced like developing these very
big llama 4 models to have like the best
like to be on the top of the benchmarks
but I don't think the goal of llama
models is to be on top of the benchmarks
beating let's say chip or other models I
think the goal was to have a model that
people can use trust uh modify
understand it so that includes having
smaller models they don't have to be the
best models and what happened was just
these models were of like the benchmarks
suggest that they were better than they
were by because I think they had like
specific models trained on preferences
that they performed well on the
benchmark. It's kind of like this
overfitting thing to kind of force it to
be the best, but then at the same time,
well, they didn't do the small models
that people could use. I think that no
one could run these big models then. And
then there was kind of like a weird
thing. And I think it's just because
people got too excited about headlines
pushing the frontier. I think
&gt;&gt; and too much like on the benchmaxing
side. Yeah. Too much
&gt;&gt; I think it imploded under political pre
like internal political fighting and
misaligned incentives. So like the
researchers want to build the best
models, but there's a layer of
organization and manager that is trying
to demonstrate that they do these
things. And then there's lots of there's
a lot of pieces and rumors where how
like some horrible technical decision
was made and how that comes in and it
just seems like it kind of got too bad
where it all just crashed out.
&gt;&gt; But we should we should also like give
huge props to Mark Zuckerberg.
I think it comes from Mark actually from
Mark Zuckerberg from the top of the
leadership saying open source is
important. I think that's like that if
the fact that that exists means there
could be a llama 5 where they learn the
lessons from the benchmaxing and say
we're going to be GPoss
and provide really awesome library of
open source. What people say is that
there's a debate between Mark and
Alexander Wong who is very bright but
much more against open source. And to
the extent that he has a lot of
influence over the AI or it seems much
less likely because it seems like Mark
brought him in for like a fresh um
leadership aid in directing AI and if
the like open or closed is no longer the
defining nature of the model, I don't
expect that to be a defining argument
between Mark and Alex. So like they're
both very bright. But I just like I have
a hard time understanding all of it
because Mark wrote this piece in July of
maybe which was like probably the best
blog post at the time saying the case
for open source AI and then July 2025
came around and it was like we're
re-evaluating our relationship with open
source. So it's just kind of like but I
think also the the problem not the
problem but I think well we may have
been a bit also too harsh I think and
that caused some of that because I think
I mean we as open source developers or
the open source community because I
think even though the model was maybe
not what everyone hoped for it got a lot
of backlash and I think that was a bit
unfortunate because I can see that as a
company now they were hoping for
positive headlines and uh instead of
just getting no headlines or not these
positive headlines in in turn they got
negative headlines and then all it kind
of reflected bad on the company and I
think that is also something like where
you it's maybe a spite reaction almost
like okay we have no we we try to do
something nice we try to g give you
something cool like an open source model
and now you are like you know kind of
like be negative about us even like like
for the company so in that sense it
looks like well maybe then we'll change
our mind I guess I don't know
&gt;&gt; yeah that's that's where the uh the
dynamics of discourse on X can lead us
as a community astray cuz sometimes it
feels random. People pick the thing they
like they don't like.
&gt;&gt; And you can see the same thing with
Grock uh 41 and Gro Code Fast one. I
don't think vibewise people
um love it publicly.
&gt;&gt; Mhm.
&gt;&gt; But a lot of people use it. So if you
look to Reddit and X, they don't really
give it praise from the programming
community,
&gt;&gt; but like they use it. And the same thing
with probably with Llama. I don't
understand I don't understand the
dynamics of either positive hype or
negative hype. I don't understand it. I
mean the story of one of the stories of
2025 is the US feeling the gap of llama
which is like all the rise of these
Chinese openweight models to the point
where I was like that was the single
issue I've spent a lot of energy on in
the last 5 months is like trying to do
policy work to get the US to invest in
this. Let's
&gt;&gt; tell me the story of Adam.
&gt;&gt; Adam project is it started as me calling
it the American Deepseek project which
doesn't really work for DC audiences but
it's the story of like what is the most
impactful thing I could do with my
career which is that the Chinese
openweight models are cultivating a lot
of power and there is a lot of demand
for building on these open models
especially in enterprises in the US that
are very cy about these Chinese models.
going to perplexity. The atom project
American truly open models is a US-based
initiative to build and host highquality
genuinely openweight AI models and
supporting infrastructure explicitly
aimed at competing with and catching up
to China's rapidly advancing open-source
AI ecosystem.
&gt;&gt; I think the one sentence summary would
be that or two sentences. One is a
proposition that open models are going
to be an engine for AI research because
that is what people start with.
Therefore, it's important to own them.
And the second one is therefore the US
should be building the best models so
that the best researcher happens in the
US and the US companies take the value
from being the home of where AI research
is happening. And without more
investment in open models, we have all
the plots on the website where it's
likew
and it's all these models that are
excellent from these Chinese companies
that are cultivating influence in the US
and China and internationally. And I
think the US is spending way more on AI.
And the ability to create open models
that are half a generation or a
generation beyond what the cutting edge
of a closed labs is costs orders of like
hundred million which is a lot of money
but not a lot of the money to these
companies. So therefore we need a
centralizing force of people who want to
do this. And I think we got signed
engagement from people pretty much
across the full stack whether it's
policy.
&gt;&gt; So there has been support from the
administration. I don't think anyone in
the like technically in government has
like signed it publicly but I know that
people that have worked in AI policy
both in Biden and Trump administration
are very supportive of trying to promote
open source models in the US. I think
for example AI2 got a grant from the NSF
for hund00 million over four years which
is like the biggest CS grant the NSF has
ever awarded and it's for AI2 to attempt
to this and I think it's a starting
point
&gt;&gt; but the best thing happens when there
are multiple organizations building
models because they can cross-pollinate
ideas and kind of build this ecosystem
like I don't think of it just works if
it's just llama releasing models to the
world because then you can see llama can
go away the same thing applies for AI
too where it's I can't be the only one
building models. And I think that's like
&gt;&gt; that
it becomes a lot of time spent on
talking to people whether they're in
policy. I know Nvidia is very excited
about this. I think Jensen Wong has been
specifically talking about the urgency
for this and they've changed they've
done a lot more in 2025 where the
Neatron models are more of a focus.
they've started releasing some data
along with Nvidia's open models and like
very few companies do this especially of
Nvidia's size. So like there is there is
signs of progress and there we hear
about reflection AI where they say their
$2 billion fund raise is dedicated to
building US open models and I feel like
their announcement tweet is like it
reads like a blog post out loud right
and I think that that cultural tide is
starting to turn. I think in in July was
when we had like four or five deep sea
caliber Chinese openweight models in
zero from the US and that's that's the
moment where I was released this and I
was like oh I guess I have to spend
energy on this cuz nobody else is going
to do it.
&gt;&gt; So it takes a lot of it takes a lot of
people contributing together and I don't
say that like the atom project isn't
like the thing that's helping to move
the ecosystem but it's people like me
doing this sort of thing to get the word
out. Uh, do you like the the 2025
America's AI action plan that includes
open source stuff? The White House AI
action plan includes a dedicated section
titled encourage open source and open
web AI defining such models and arguing
they have unique value for innovation
and startups. Yeah, I mean like the AI
action plan is a plan, but largely I
think it's like maybe the most coherent
policy document that has come out of the
administration and I hope that it
largely succeeds and I know people that
have worked on the AI action plan and
the challenge is taking policy and
making it real and I have no idea how to
do this as an AI researcher but like
largely a lot of things in that were
very real and there's a huge buildout of
AI in the country and it's like there
are a lot of issues that people are
hearing about from water use to whatever
and like we should be able to build
things in this country but also we need
to not ruin places in our country in the
process of building it and it's
worthwhile to spend energy on. I think
that's a role that the federal
government plays. It's like they set the
agenda and with AI setting the agenda
that open weight should be a first
consideration is like that's a large
part of what they can do and then people
think about it also for education and
talent for these companies. It's I think
very important because otherwise you
know if there are only closed um models
how do you get the next generation of
people contributing at some point
because otherwise you will at some point
only be able to learn after you joined a
company but then at that point like how
do you hire talented people how do you
identify talented people and I think
open source is let's say even for a lot
of things but also even just for
educating the population and training
the next generation of researchers
because it's the way or the only way.
&gt;&gt; The way that I could have gotten this to
more go more viral is was to tell a
story of Chinese AI integrating with an
author authoritarian state and being ASI
and taking over the world and therefore
we need our own American models. But
it's very intentional for why I talk
about innovation and science in the US
because I think it's both more realistic
as an outcome but just like like it's a
world that is I would like to manifest.
I would say though also even like let's
say uh any open weight model I do think
is a valuable model.
&gt;&gt; Yeah. And my argument is that we should
be in a leading position. But I think
that it's
&gt;&gt; worth saying it so simply because there
are still voices in the AI ecosystem
that say we should consider banning
releasing open models due to the safety
risks. And I think it's worth adding
that I think effectively that's
impossible without making the US like
have its own great firewall which is
also known to not um work that well
because the cost for training these
models whether it's one to hund00
million is attainable to a huge amount
of people in the world that want to have
influence. So these models will be
getting trained all over the world and
these we want the models especially when
like I mean there are safety concerns
but we want these information and tools
to flow freely across the world and into
the US so that we people can use them
and learn from them and we like stopping
that would be such a restructuring of
our internet that it seems impossible.
Do you think maybe in that case the big
open weight models from China are
actually a good thing in a sense like
for the US companies because maybe the
US companies you you mentioned earlier
they are usually one generation behind
in terms of what they release open
source versus what they are using for
example GPTOS as might not be the
cutting edge model JMA 3 might not be
but they do that because they know this
is safe to release but then when they
see these companies see for example
there is deepseek version 3.2 too which
is really awesome and it gets used and
there is no backlash there is no
security risk that could then again
encourage them to release better models
maybe that that in a sense is a very
positive thing
&gt;&gt; 100% these Chinese companies have set
things into motion that I think would
&gt;&gt; potentially not have happened if they
were not all releasing models so I think
this like
&gt;&gt; I'm almost sure that those discussions
have been had by leadership
&gt;&gt; is there a possible future where the
dominant models, AI models in the world
that are all open source.
&gt;&gt; Depends on the trajectory of progress
that you predict. If you think
saturation and progress is even coming
within a few years. So essentially
within the time where financial support
is still very good, then open models
will be so optimized and so much cheaper
to run that they will win out.
Essentially, this goes back to open-
source ideas where so many more people
will be putting money into optimizing
this serving of these openweight common
architectures that they will become
standards and then you could have chips
dedicated to them and it'll be way
cheaper than the um offerings from these
closed companies that are custom. We
should say that AI 27 report kind of
predicts one of the things it does from
a narrative perspective is that there
will be a lot of centralization. As the
AI system gets smarter and smarter, the
national security concerns will come to
be and you'll centralize the labs and
you'll become super secretive and
there'll be this whole race from a
military perspective of how do you
between China and the United States. And
so all of this fun conversations we're
having about LMS,
all the generals, the soldiers will come
into the room and be like, "All right,
we're now in the Manhattan project stage
of this whole thing." I think 2025 67 27
I don't think something like that is
even remotely possible. I mean, you can
make the same argument for computers,
right? You can say, "Okay, computers are
capable and we don't want the general
public to get them or chips, even AI
chips." But you see how like you know
Huawei makes chips now you know took a
few years but and I think that I don't
think there is a way you can contain
something like that like knowledge like
that I I think in this day and age it is
impossible like the internet
I don't think this is a possibility
&gt;&gt; on the Manhattan project thing one of my
funny things making Adam is I think that
like a Manhattan project like thing for
open models would actually be pretty
reasonable because it wouldn't cost that
much but I think that that will Um, but
it seems like culturally the companies
are changing. But I agree with Sebastian
and all the stuff that you just said.
It's just like I don't see it happening
nor being helpful. Yeah. I mean the the
motivating force behind the Manhattan
project is there was civilizational
risk. I It's harder to motivate that for
open source models.
&gt;&gt; There's not civilizational risk.
&gt;&gt; Uh you think uh on the hardware side, we
mentioned Nvidia a bunch of times. Do
you think Jensen and Nvidia are going to
keep winning? I think they have the
downside that they have to iterate a lot
and uh manufacture a lot and I think
they probably
what what they're doing they do innovate
but um I think there's always the chance
that there is something who does
something fundamentally different who
gets very lucky and then does something
but the problem is I think adoption you
know like the the mode of Nvidia is
probably not just the GPU it's more like
the CUDA ecosystem and that has evolved
over so many two decades I think I mean
even back when I was a grad student we I
was in a lab where we did biohysical
simulations molecular dynamics and we
had a Tesla GPU back then just for the
computations it was I mean 15 years ago
now and it just they built this up for a
long time and that's like that's the
mode I think it's not the the chip
itself although they have now the money
to iterate and build and scale but then
it's really um the compatibility it's
like well if you're at that scale as a
company why would you go with something
risky where it's only
&gt;&gt; a few chips that they can make per year,
you go with a big one. But then
&gt;&gt; I do think with LLMs now also it will be
easier to design something like CUDA you
know like the next so it took 15 years
because it's hard but then now we have
LLMs we can maybe replicate CUDA
&gt;&gt; and I wonder if there will be a
separation of the training and the
inference compute
&gt;&gt; as we kind of stabilize a bit more and
more and more
uh compute is needed for inference. Mhm.
&gt;&gt; That's supposed to be the point of the
Groc acquisition and and that's why part
of what Vera Rubin is where they have a
new chip with no high bandwidth memory
which is one of or very little which is
one of the most expensive pieces. It's
designed for prefill which is the part
of inference where you essentially do a
lot of matrix multiplications and then
you only need the memory when you're
doing this auto reggressive generation
and you have the KV cache swaps. So they
have this new GPU that's designed for
that specific use case and then the cost
of ownership per per flop or whatever is
actually way lower. But I I think that
Nvidia's fate lies in the diffusion of
AI still. their biggest clients are
still these hypers scale companies
whether it's like Google obviously can
make TPUs Amazon is making tranium
Microsoft will try to do its own things
and like
so long as the pace of AI progress is
high Nvidia's platform is the most
flexible and people will want that but
if there's stagnation then creating
bespoke chips there's more time to do it
it's interesting that uh Nvidia is is
quite active in trying to develop all
kinds of different products They tried
to create areas of commercial value that
will use a lot of GPUs,
&gt;&gt; but they keep innovating and and there
there's a like they're doing a lot of
incredible research. So,
&gt;&gt; everyone says the company is super
oriented around Jensen and how
operationally plugged in he is and it
sounds so unlike many other big
companies that I've heard about. And so
long as that's the culture, I think that
I will expect them to keep progress
happening. And it's like he's still in
the Steve Jobs era of Apple. So long as
that is how it operates, I'm pretty
optimistic for their situation because
it's like it is their top order problem
and I don't know if making these chips
for the whole ecosystem is the top goal
of all of these other companies. They
will do a good job but it might not be
as good of a job. Since you mentioned
Jensen, I've been um reading a lot about
history and about singular figures in
history. What do you guys think about
the single man woman view of history?
how important are individuals for
steering the direction of history in the
tech sector. So you know what's Nvidia
without Jensen? You mentioned Steve
Jobs. What's Apple without Steve Jobs?
What's XAI
without Elon
or Deep Mind without Demis? people make
things earlier and faster where
scientifically
many great scientists credit to being in
the right place at the right time and
still making the innovation where
eventually someone else will still have
the idea. So I think that in that way
Jensen is helping manifest this GPU
revolution much faster and much more
focused than without having a person
there it would do. And this is making
the whole AI build out faster. But I do
still think that eventually like
something like Chachi BT would have
happened and a buildout like this would
have happened but it probably would not
have been as fast or like like I think
that's the sort of flavor that is
applied. People these individual people
are people who are placing bets on
something. Some get lucky, some don't.
But if you don't have these people at
the helm, it will be more diffused. It's
almost like investing in a ETF versus
individual stocks. individual stocks
might go up, might go down more heavily
than an ETF which is more balanced. It
will eventually go up over time. We'll
get there. But it's just like, you know,
like focus, I think, is the thing.
Passion and focus.
&gt;&gt; Isn't there a real case to be made that
without Jensen, there's not a
reinvigoration of the the deep learning
revolution.
&gt;&gt; It could have been 20 years later is the
thing that it would say.
&gt;&gt; Yeah. Yeah. 20 years
&gt;&gt; or like another AI like a deep learning
winter could have come
&gt;&gt; if GPUs weren't around. that could
change history completely because you
could think of all the other
technologies that that could have come
in the meantime and the focus of human
civilization could the Silicon Valley
would be captured by different hype but
I do think it is uh I mean there's
certainly an aspect where it was all
planned the GPU trajectory but on the
other hand it's also a lot of lucky
coincidences for example or good
intuition like the investment into this
let's say biohysical simulations or like
I mean I think it started with video
games and and it just happened to be
good at linear algebra because video
games require a lot of linear algebra
and then you have the biohysical um
simulations and then but still I don't
think the plan the master plan was AI I
think there was just it happened to be
Alex Kresevki so someone took these GPUs
and like hey let's try to train a neur
network on that and happened to work
really well and I think it only happened
because you could purchase those GPUs
&gt;&gt; gaming would have created a demand for
faster processors if Nvidia had gone out
of business in the early days.
&gt;&gt; Mhm.
&gt;&gt; That's what I would think. Like I think
that
the GPUs would have been different for
the Alex, but I think like GPUs would
still exist at the time of Alex Net and
at the time of the Transformer.
&gt;&gt; It was just hard to know if it would be
&gt;&gt; one company as successful or multiple
smaller companies with worse
&gt;&gt; chips. But I don't think that's like a
100redyear delay. It might be a decade
delay. Well, it could be one, two,
three, four, five decade delay. I mean,
I just can't see Intel or AMD doing what
Nvidia
&gt;&gt; I don't think it would be a company that
exists. I think it would be a different
company would ride
&gt;&gt; like Silicon Graphics or something.
&gt;&gt; So, yeah, some company that has died
would have done it.
&gt;&gt; But it does like just look looking at
it, it seems like these singular
figures, these leaders have a huge
impact on the trajectory of the world.
obviously incredible teams behind them
but you know having that kind of
very singular almost dogmatic focus
&gt;&gt; is necessary to make progress.
&gt;&gt; Yeah. I mean even with GPT it wouldn't
exist if there wasn't a person IA who
pushed for this scaling right. I mean
&gt;&gt; yeah Dario was also deeply involved in
that. You read some of the histories of
OpenAI. It almost seems wild thinking
about how early these people were like,
"We need to hook up 10,000 GPUs and take
all of OpenAI's compute and train one
model." There's a lot of people there
that didn't want to do that, which is an
insane thing to believe that to believe
scaling before scaling has any
indication that it's going to
materialize. Again, singular figures.
Speaking of which, 100 years from now,
this is presumably post singularity,
whatever singularity is, when historians
look back at our time now, what
technological
breakthroughs would they really
emphasize as the breakthroughs that led
to the singularity. So, so far we have
touring to today, 80 years. I think it
would still be computing like the
umbrella term computing just I don't
necessarily think it's even like 100
years 200 years from now it would be AI
it would it could be still well be
computers you know it just we are now
taking better advantage of computers but
like the fact of computing
&gt;&gt; it's a basically Mo's law kind of
discussion you're not even the details
of code and GPUs won't even be
remembered and it won't be all this
software turmoil
&gt;&gt; it'll be just obviously
compute
&gt;&gt; I generally agree but it's like is the
connectivity of the internet and compute
able to be merged or is the both of them
I think the internet will probably be
related to yeah I mean communication it
could be phone internet satellite that
stuff um where yeah and compute is the
more like the scaling aspect of it it's
possible that the internet is completely
forgotten
&gt;&gt; the internet is wrapped into the phone
networks like communication networks.
This is just another manifestation of
that and the real breakthrough comes
from the just the increased compute is
the Moors law broadly defined. Well, I
think that connection of people is very
fundamental to it. So, it's like
you can talk to any you want to find the
best person in the world for something
they are somewhere in the world and
being able to have that flow of
information. The AIs will also rely on
this. I think I've been fixating on the
like um the when I said the dream was
dead about the one central model and the
thing that is evolving is like people
have many agents for different tasks
people start doing this with different
clouds for different tasks and it's
described as many AGIs in the data
center where each one manages and they
talk to each other and like that is so
reliant on networking and free flow of
information on top of compute but like
networking especially with GPUs is such
a part of scaling up compute like the
GPUs and the data centers need to talk
to each other. anything about neural
networks will be remembered? Like do you
think there's something very specific
and singular to the fact that it's
neural networks that's seen as a
breakthrough like a genius that you're
basically replicating in a very crude
way the human mind the struct the
structure of the human brain the human
mind I think without the human mind we
probably wouldn't have neuronet networks
because it just was an inspiration for
that but at the other end I think it's
just still so different I mean it's
digital versus you know biological that
I do think it it will probably be more
like grouped as an algorithm
&gt;&gt; that's massively paralyzable on this
particular kind of compute
&gt;&gt; could have well been like genetic
computing like genetic genetic
algorithms just this paralyzed thing it
just happens that this is more efficient
works better you know
&gt;&gt; and it very well could be that the LLM
you know the neural networks the way we
architect them now is just a small
component of the
system that leads to singularity
&gt;&gt; I think is if you think of it 100 years
like society I think can be changed more
with more compute and intelligence
because of autonomy. But it's like
looking at looking at this like what are
the things from the industrial
revolution that we remember? remember
like the engine is probably the
equivalent of the computer in this
&gt;&gt; but there's a lot of other like physical
transformations that people are aware of
like like all the the cotton and all
these things that these machines that
are still known air conditioning
refrigerators like some of these things
from AI will still be known like the
word transformer could still very well
be known I would guess that deep
learning is definitely still learn known
but the transformer might be evolved
away from in 100 years of with ASI AI
researchers everywhere. But I I think
deep learning is
likely to be a term that is remembered.
And I wonder what the air conditioning
and the refrigeration of the future is
that AI brings. Is there uh if we travel
forward 100 years from now, we transport
there right now? What do you think is
different? How do you think the world
looks different? First of all, you think
there's humans? You think there's robots
everywhere walking around?
&gt;&gt; I do think specialized robots for sure
for certain tasks.
&gt;&gt; Humanoid form.
&gt;&gt; Um that I'm maybe half humanoid. We'll
see. I I think for certain things, yes,
uh there will be humanoid robots because
it's just amenable for the environment,
but like for certain tasks, it might
make sense. What's harder to imagine is
how we interact with the devices and
what humans do with devices. Will I mean
I'm pretty sure will probably not be the
cell phone, will probably not be the
laptop. Will it be you know implants?
&gt;&gt; I mean it has to be brain computer
interfaces, right? I mean 100 years from
now it has to like given the progress
we're seeing now there has to be
&gt;&gt; unless there's legitimately
complete
alteration of how we interact with
reality. On the other hand, if you think
of cars, cars are older than 100 years,
right? And it's still the same
interface. It's not. We haven't replaced
cars with something else. We just made
the cars better, but it's still steering
wheel. It's still wheels, you know.
&gt;&gt; I think we'll still carry around a
physical brick of compute because people
want some ability to have a private like
you might not engage with it as much as
a phone, but having something where you
could have private information that is
yours as an interface between the rest
of the internet, I think is something
that people will still exist. It might
not look like an iPhone and it might be
used a lot less, but I still expect to
have people carry things around.
&gt;&gt; Why do you think the smartphone is the
embodiment of private? There's a camera
on it.
&gt;&gt; Um, private for you, like encrypted
messages, encrypted photos. You know
what your life is like?
&gt;&gt; I guess this is a question on whether
how optimistic on brain machine
interfaces you are. If is all of that
just going to be stored in the cloud in
your whole calendar? Like it it's hard
to think about processing all the
information that we can process visually
through brain machine interfaces
presenting something like a calendar or
something to you. Like
&gt;&gt; it's hard to just think about knowing
without looking. You know your email
inbox.
&gt;&gt; Like you signal to a computer and then
you just know your email inbox. Like
what does that like is that something
that the human brain can handle being
piped into it nonvisisually?
Like I I don't know exactly how those
transformations happen cuz humans aren't
changing in 100 years.
&gt;&gt; I think agency and community are things
that people actually want.
&gt;&gt; Local community.
&gt;&gt; So like people you are close to being
able to do things with them and being
able to ascribe mean like describe
meaning to your life and to be able to
do things. I think that that is maybe if
not in a hundred years I don't think
that human biology is changing away from
those on a time scale that we can
discuss and I think that like UBI does
not solve agency. I do expect mass
wealth and I hope that it is spread so
that the average life does look very
different in a 100 years but that's
still a lot to happen in 100 years if
you think about countries that are early
in their development process to getting
access to computing and internet like to
build all the infrastructure and to have
policy that shares one nation's wealth
with another is it's I think it's an
optimistic view to see all of that
happening in 100 years while they still
being while they are still independent
entities and not just like absorbed into
some international order by force. But
there could be just better, more
elaborate, more effective
&gt;&gt; social support systems that help
alleviate some levels of basic suffering
from the world. You know, the
transformation of society where a lot of
jobs are lost in the short term. I think
we have to really remember that each
individual job that's lost is a is a
human being who's suffering. That's like
a when jobs are lost at scale is a real
tragedy.
You can make all kinds of arguments
about economics or it's it's all going
to be okay. It's it's good for the GDP.
There's going to be new jobs created
fundamentally at the individual level
for that human being. That's that's real
suffering. That's a real personal sort
of tragedy and we have to not forget
that as the technologies are being
developed. And also my my hope for all
the AI slop we're seeing is that there
will be a greater and greater premium
for the the fundamental
aspects of the human experience that are
like in person. The things that we all
like seeing each other talking together
in person. The next few years are
definitely going to be an increased
value on physical goods and events and
even more pressure on slop.
&gt;&gt; So there'll be so there will keep the
slop is only starting the next few years
will be more and more diverse versions
of slop.
&gt;&gt; They would be drowning in slop.
&gt;&gt; So I'm hoping that we society drowns in
slop enough to snap out of it and be
like we can't like none like it just
doesn't matter. we all can't deal with
it and then like the physical has such a
higher premium on it.
&gt;&gt; Even like uh classic examples I I
honestly think this is true and I I
think we get tired of it. We are already
kind of tired of it. Same with I mean
even art. I don't think art will go
away. I mean you have paintings physical
paintings there's more value not just
monetary value but just more value
appreciation for something that is the
actual painting than a photocopy of that
painting. It could be a perfect digital
reprint of that. But there is something
when you go to a museum and you look at
that art and you see that real thing and
you think about okay a human I don't
know it's like a craft you have like
appreciation for that. And I think the
same is true for writing for talking for
any type of experience where it will be
I do unfortunately think it will be like
a dichom like it will be like a fork
where some things will be automated like
you know there are not as many paintings
as they used to be 200 years ago. there
are more more photographs more photo
copies but at the same time it won't go
away there will be a you know value in
that I think that the difference will
just be a bit you know what's the
proportion of that but personally I I
have a hard time reading things where I
obviously see it's um obviously AI
generated I'm like sorry there might be
really good information there but I have
like a certain nah not for me I think
eventually they'll fool you and it'll be
on platforms that give ways verifying or
building trust. So you will trust that
Lex is not AI generated having been
here. So then you have trust in this
channel but it's harder for new people
that don't have that trust.
&gt;&gt; Well that will get interesting because I
think fundamentally I think is a
solvable problem by having you know
trust in certain outlets that they won't
do it but it's all going to be kind of
trust based. There will be some systems
to authorize okay this is real this is
not real. There will be some tellt tell
signs where you can obviously tell this
is AI generated and this is not but they
won't I mean some will be so good that
it's hard to tell and then you have to
trust and um well that that will get
interesting and a bit problematic. The
extreme case of this is to watermark all
human content. So all photos that we
take on our own have some watermark
until they are edited or something like
this and the software can manage
communications with the device
manufacturer to maintain like human
editing
&gt;&gt; which is the opposite of the discussion
to try to watermark AI images and then
you can make a Google image that has a
watermark and use a different Google
tool to remove the watermark.
&gt;&gt; Yeah, it's going to be Tom's race
basically. uh and we've been mostly
focusing on the positive aspects of AI.
I mean there's also the the all the
capabilities we've been talking about
can be used to destabilize human
civilization with even just relatively
dumb AI applied at scale and then
further and further super intelligent AI
systems. Of course there's the the sort
of doom or take that's important to
consider a little bit as we develop
these technologies. Um, what gives you
hope about the future of human
civilization? Everything we've been
talking about.
Are we going to be okay?
&gt;&gt; I think we we will. I'm I'm definitely a
worrier both about AI and non AI things,
but um, humans do tend to find a way. I
think that's what humans are built for
is to have community and find a way to
figure out problems. And that's what has
gotten us to this point. And I think
that the AI opportunity in related
technologies is really big. And I think
that there's big social and political
problems to help everybody understand
that. And I think that that's where
we're staring at a lot of right now is
like the world is a scary place and AI
is a very uncertain thing and it takes a
lot of work that is not necessarily
building things. It's like telling
people and understanding people that the
people building AI are historically not
motivated or wanting to do. But it is
something that is probably doable. Just
will take longer than people want. We
have to go through that long period of
like hard distra AI discussions if we
want to have the lasting benefits.
&gt;&gt; Yeah. through that process. I'm
especially excited that we get a chance
uh to better understand ourselves
us at the individual level as humans and
at the civilization level answer some of
the big mysteries like what is this
whole like consciousness thing going on
here seems to be truly special like
there's a real miracle in our mind and
AI puts a mirror to ourselves when we
get to answer some of the big questions
about like what what is this whole thing
going on here.
&gt;&gt; Well, one thing about that is also what
I do think makes us very different from
AI and why I don't worry about AI taking
over is like you said consciousness, we
humans, we decide what we want to do. AI
in its current implementation and I
can't see it changing. You have to tell
it what to do and so you have still the
agency. It doesn't take the agency from
you because you have to you you just it
becomes a tool. You can think of it as a
tool. You tell it what to do. It will be
more automatic than other previous
tools. It's certainly more powerful than
a hammer. It can figure things out, but
it's still you in in in charge, right?
So, the AI is not in charge. You're in
charge. You tell the AI what to do, and
it's doing it for you.
&gt;&gt; So, in the post singularity,
postapocalyptic
war between humans and machines, you're
saying humans are worth fighting for
&gt;&gt; 100%. I mean,
this is the movie Terminator they made
in the 80s essentially. And I do think
well
the only thing I can see going wrong is
of course if things are explicitly
programmed to do the thing that is
harmful basically. I think actually in
that in a Terminator type of setup I
think humans win.
&gt;&gt; Mhm.
I think we're too clever.
Um it's hard to explain how we figure it
out but we do. And uh we'll probably be
using local LLMs, open source LLMs to
help fight the machines.
Um I apologize for the ridiculousness.
Uh like I said, Nathan already knows
I've been a a big fan of his for a long
time. Been a big fan of yours,
Sebastian, for a long time. So it's an
honor to finally meet you. Uh thank you
for everything you put out into the
world. Thank you for the excellent books
you're writing. Thank you for teaching
us. Uh and uh thank you for talking
today. This was fun. Thank you for
inviting us here and having this human
connection uh which is extremely
valuable human connection.
&gt;&gt; Thanks for listening to this
conversation with Sebastian Rashka and
Nathan Lambert. To support this podcast,
please check out our sponsors in the
description where you can also find
links to contact me, ask questions, give
feedback, and so on. And now let me
leave you with some words from Albert
Einstein.
It is not that I'm so smart, but I stay
with the questions much longer.
Thank you for listening and hope to see
you next time.