Cursor Team: Future of Programming with AI | Lex Fridman Podcast #447
oFfVt3S51T4 • 2024-10-06
Transcript preview
Open
Kind: captions
Language: en
the following is a conversation with the
founding members of the cursor team
Michael truell swall oif Arvid lunark
and Aman Sanger cursor is a code editor
based on VSS code that adds a lot of
powerful features for AI assisted coding
it has captivated the attention and
excitement of the programming and AI
communities so I thought this is an
excellent opportunity to dive deep into
the role of AI in programming this is a
super technical conversation that is
bigger than just about one code editor
it's about the future of programming and
in general the future of human AI
collaboration in designing and
Engineering complicated and Powerful
systems this is Le Freedman podcast to
support it please check out our sponsors
in the description and now dear friends
here's Michael suale Arvid and
Aman all right this is awesome we have
Michael Aman suali Arvid here from the
cursor team first up big ridiculous
question what's the point of a code
editor so the the code editor is largely
the place where you build software and
today or for a long time that's meant
the place where you text edit uh a
formal programming language and for
people who aren't programmers the way to
think of a code editor is like a really
souped up word processor for programmers
where the reason it's it's souped up is
code has a lot of structure and so the
the quote unquote word processor the
code editor can actually do a lot for
you that word processors you know sort
of in the writing space haven't been
able to do for for people editing text
there and so you know that's everything
from giving you visual differentiation
of like the actual tokens in the code to
so you can like scan it quickly to
letting you navigate around the code
base sort of like you're navigating
around the internet with like hyperlinks
you're going to sort of definitions of
things you're using to error checking um
to you know to catch rudimentary B
um and so traditionally that's what a
code editor has meant and I think that
what a code editor is is going to change
a lot over the next 10 years um as what
it means to build software maybe starts
to look a bit different I I think also
code edor should just be fun yes that is
very important that is very important
and it's actually sort of an underated
aspect of how we decide what to build
like a lot of the things that we build
and then we we try them out we do an
experiment and then we actually throw
them out because they're not fun and and
so a big part of being fun is like being
fast a lot of the time fast is fun yeah
fast
is uh yeah that should be a
t-shirt like like
fundamentally I think one of the things
that draws a lot of people to to
building stuff on computers is this like
insane integration speed where you know
in other disciplines you might be sort
of gate capped by resources or or the
ability even the ability you know to get
a large group together and coding is
just like amazing thing where it's you
and the computer and uh that alone you
can you can build really cool stuff
really quickly so for people don't know
cursor is this super cool new editor
that's a fork of vs code it would be
interesting to get your kind of
explanation of your own journey of
editors how did you I think all of you
are were big fans of vs code with
co-pilot how did you arrive to VSS code
and how did that lead to your journey
with cursor yeah um
so I think a lot of us well all of us
originally Vim users pure pure VI pure
Vim yeah no neo just pure Vim in a
terminal and at Le at least for myself
it was around the time that C- pilot
came out so
2021 that I really wanted to try it so I
went into vs code the only platform the
only code editor in which it was
available
and even though I you know really
enjoyed using Vim just the experience of
co-pilot with with vs code was more than
good enough to convince me to switch and
so that kind of was the default until we
started working on cursor and uh maybe
we should explain what copala does it's
like a really nice
autocomplete it suggests as you start
writing a thing it suggests one or two
or three lines how to complete the thing
and there's a fun experience in that you
know like when you have a close
friendship and your friend completes
your
sentences like when it's done well
there's an intimate feeling uh there's
probably a better word than intimate but
there's a there's a cool feeling of like
holy it gets
me now and then there's an unpleasant
feeling when it doesn't get you uh and
so there's that that kind of friction
but I would say for a lot of people the
feeling that it gets me over powers that
it doesn't and I think actually one of
the underrated aspects of get up copet
is that even when it's wrong is it's
like a little bit annoying but it's not
that bad because you just type another
character and then maybe then it gets
you or you type another character and
then then it gets you so even when it's
wrong it's not that bad yeah you you can
sort of iterate iterate and fix it I
mean the other underrated part of uh
calot for me sort of was just the first
real real AI product it's like the first
language model consumer product so
copile was kind of like the first killer
app for LMS yeah and like the beta was
out in 2021 right okay mhm uh so what's
the the origin story of cursor so around
2020 the scaling loss papers came out
from from open Ai and that was a moment
where this looked like clear predictable
progress for the field where even if we
didn't have any more ideas looked like
you could make these models a lot better
if you had more computer and more data
uh by the way we'll probably talk uh for
three to four hours on on the topic of
scaling laws but just just to summarize
it's a paper and a set of papers and set
of ideas that say bigger might be better
for model size and data size in the in
the realm of machine learning it's
bigger and better but predictively
better okay this another topic of
conversation but anyway yeah so around
that time for some of us there were like
a lot of conceptual conversations about
what's this going to look like what's
the the story going to be for all these
different knowledge worker Fields about
how they're going to be um made better U
by this technology getting better and
then um I think there were a couple of
moments where like the theoretical gains
predicted in that paper uh started to
feel really concrete and it started to
feel like a moment where you could
actually go and not you know do a PhD if
you wanted to work on uh do useful work
in AI actually felt like now there was
this this whole set of systems one could
built that were really useful and I
think that the first moment we already
talked about a little bit which was
playing with the early bit of copell
like that was awesome and magical um I
think that the next big moment where
everything kind of clicked together was
actually getting early access to gbd4 so
sort of end of 2022 was when we were um
tinkering with that model and the Step
Up in capabilities felt enormous and
previous to that we had been working on
a couple of different projects we had
been um because of co-pilot because of
scaling laws because of our prior
interest in the technology we had been
uh tinkering around with tools for
programmers but things that are like
very specific so you know we were
building tools for uh Financial
professionals who have to work with in a
juper notebook or like you know playing
around with can you do static analysis
with these models and then the Step Up
in gbd4 felt like look that really made
concrete the theoretical gains that um
we had predicted before felt like you
could build a lot more just immediately
at that point in time and
also if we were being consistent it
really felt like um this wasn't just
going to be a point solution thing this
was going to be all of programming was
going to flow through these models it
felt like that demanded a different type
of programming environment to different
type of programming and so we set off to
build that that sort of larger Vision
around then there's one that I
distinctly remember so my roommate is an
IMO gold winner and uh there's a
competition in the US called of putam
which is sort of the IMO for college
people and it's it's this math
competition is he's exceptionally good
so Shang Tong and Aman I remember it
sort of June June of
2022 had this bet on whether the mo like
2024 June or July you were going to win
a gold medal in the Imo with the with
like models IMO is international math
Olympiad uh yeah IMO is international
math Olympiad and so Arvid and I are
both of you know also competed in it so
was sort of personal and uh and I I
remember thinking Matt is just this is
not going to happen this was like it un
like even though I I sort of believed in
progress I thought you know I'm a girl
just like Aman is just delusional that
was the that was the and and to be
honest I mean I I was to be clear it
very wrong but that was maybe the most
preent bet in the group so the the new
results from Deep Mind it turned out
that you were correct that's what well
it technically not technically incorrect
but one point awayan was very
enthusiastic about this stuff back then
and before Aman had this like scaling
loss t-shirt that he would walk around
with where it had like charts and like
the formulas on it oh so you like felt
the AI or you felt the scaling yeah I i
l remember there was this one
conversation uh I had with with Michael
where before I hadn't thought super
deeply and critically about scaling laws
and he kind of posed the question why
isn't scaling all you need or why isn't
scaling going to result in massive gains
in progress and I think I went through
like the like the stages of grief there
is anger denial and then finally at the
end just thinking about it uh acceptance
um and I think I've been quite hopeful
and uh optimistic about progress since I
think one thing I'll caveat is I think
it also depends on like which domains
you're going to see progress like math
is a great domain because especially
like formal theor improving because you
get this fantastic signal of actually
verifying if the thing was correct and
so this means something like RL can work
really really well and I think like you
could have systems that are perhaps very
superhuman in math and still not
technically have ai okay so can we take
it off all the way to cursor mhm and
what is cursor it's a fork of vs code
and VSS code is one of the most popular
editors for a long time like everybody
fell in love with it everybody left Vim
I left dmax for it
sorry
uh uh so it unified in some fun
fundamental way the uh the developer
community and then that you look at the
space of things you look at the scaling
laws AI is becoming amazing and you
decide decided okay it's not enough to
just write an extension Fe vs
code because there's a lot of
limitations to that we we need if AI is
going to keep getting better and better
and better we need to really like
rethink how the the AI is going to be
part of the editing process and so you
decided to Fork vs code and start to
build a lot of the amazing features
we'll be able to to to talk about but
what was that decision like because
there's a lot of extensions including
co-pilot of vs code that are doing so AI
type stuff what was the decision like to
just Fork vs code so the decision to do
an editor seemed kind of self-evident to
us for at least what we wanted to do and
Achieve because when we started working
on the editor the idea was these models
are going to get much better their
capabilities are going to improve and
it's going to entirely change how you
build software both in a you will have
big productivity gains but also radical
in how like the active building software
is going to change a lot and so you're
very limited in the control you have
over a code editor if you're a plugin to
an existing coding environment um and we
didn't want to get locked in by those
limitations we wanted to be able to um
just build the most useful stuff okay
well then the natural question
is you know VSS code is kind of with
copilot a competitor so how do you win
is is it basically just the speed and
the quality of the features yeah I mean
I think this is a space that is quite
interesting perhaps quite unique where
if you look at previous Tech waves
maybe there's kind of one major thing
that happened and unlocked a new wave of
companies but every single year every
single model capability uh or jump you
get model capabilities you now unlock
this new wave of features things that
are possible especially in programming
and so I think in AI programming being
even just a few months ahead let alone a
year ahead makes your product much much
much more useful I think the cursor a
year from now will need to make the
cursor of today look
Obsolete and I think you know Microsoft
has' done a number of like fantastic
things but I don't think they're in a
great place to really keep innovating
and pushing on this in the way that a
startup can just rapidly implementing
features and and push yeah like and and
kind of doing the research
experimentation
necessary um to really push the ceiling
I don't I don't know if I think of it in
terms of features as I think of it in
terms of like capabilities for for
programmers it's that like you know as
you know the new one model came out and
I'm sure there are going to be more more
models of different types like longer
context and maybe faster like there's
all these crazy ideas that you can try
and hopefully 10% of the crazy ideas
will make it into something kind of cool
and useful and uh we want people to have
that sooner to rephrase it's like an
underrated fact is we're making it for
oursel when we started cursor you really
felt this frustration that you know
models you could see models getting
better uh but the coall experience had
not changed it was like man these these
guys like the steing is getting higher
like why are they not making new things
like they should be making new things
they should be like you like like
where's where's where's all the alpha
features there there were no Alpha
features it was like uh I I'm sure it it
was selling well I'm sure it was a great
business but it didn't feel I I'm I'm
one of these people that really want to
try and use new things and was just
there's no new thing for like a very
long while yeah it's interesting uh I
don't know how you put that into words
but when you compare a cursor with
copilot copilot pretty quickly became
started to feel stale for some reason
yeah I think one thing that I think uh
helps us is that we're sort of doing it
all in one where we're developing the
the ux and the way you interact with the
model and at the same time as we're
developing like how we actually make the
model give better answers so like how
you build up the The Prompt or or like
how do you find the context and for a
cursor tab like how do you train the
model um so I think that helps us to
have all of it like sort of like the
same people working on the entire
experience on end yeah it's like the the
person making the UI and the person
training the model like sit to like 18
ft away so often the same person even
yeah often often even the same person so
you you can you create things that that
are sort of not possible if you're not
you're not talking you're not
experimenting and you're using like you
said cursor to write cursor of course oh
yeah yeah well let's talk about some of
these features let's talk about the all-
knowing the all powerful praise B to the
tab so the you know autocomplete on
steroids basically so what how does tab
work what is tab to highlight and
summarize it a high level I'd say that
there are two things that curser is
pretty good at right now there there are
other things that it does um but two
things it it helps programmers with one
is this idea of looking over your
shoulder and being like a really fast
colleague who can kind of jump ahead of
you and type and figure out what you're
what you're going to do next and that
was the original idea behind that was
kind kind of the kernel the idea behind
a good autocomplete was predicting what
you're going to do next you can make
that concept even more ambitious by not
just predicting the characters after
cursor but actually predicting the next
entire change you're going to make the
next diff the next place you're going to
jump to um and the second thing cursor
is is pretty good at right now too is
helping you sometimes jump ahead of the
AI and tell it what to do and go from
instructions to code and on both of
those we've done a lot of work on making
the editing experience for those things
ergonomic um and also making those
things smart and fast one of the things
we really wanted was we wanted the model
to be able to edit code for us uh that
was kind of a wish and we had multiple
attempts at it before before we had a
sort of a good model that could edit
code for
you U then after after we had a good
model I think there there have been a
lot of effort to you know make the
inference fast for you know uh having
having a good good
experience and uh we've been starting to
incorporate I mean Michael sort of
mentioned this like ability to jump to
different places and that jump to
different places I think came from a
feeling off you know once you once you
accept an edit um was like man it should
be just really obvious where to go next
it's like it's like I I made this change
the model should just know that like the
next place to go to is like 18 lines
down like uh if you're if you're a whim
user you could press 18 JJ or
whatever but like why why even why am I
doing this like the model the model
should just know it and then so so the
idea was you you just press tab it would
go 18 lines down and then make it would
show you show you the next edit and you
would press tab so it's just you as long
as you could keep pressing Tab and so
the internal competition was how many
tabs can we make them pressive once you
have like the idea uh more more uh sort
of abstractly the the thing to think
about is sort of like once how how how
are the edit sort of zero zero entropy
so once You' sort of expressed your
intent and the edit is there's no like
new bits of information to finish your
thought but you still have to type some
characters to like make the computer
understand what you're actually thinking
then maybe the model should just sort of
read your mind and and all the zero
entropy bits should just be like tabbed
away yeah that was that was sort of the
abstract there's this interesting thing
where if you look at language model loss
on on different domains um I believe the
bits per bite which is kind of character
normalized loss for code is lower than
language which means in general there
are a lot of tokens in code that are
super predictable lot of characters that
are super predictable um and this is I
think even magnified when you're not
just trying to autocomplete code but
predicting what the user is going to do
next in their editing of existing code
and so you know the gold cursor tab is
let's eliminate all the low entropy
actions you take inside of the editor
when the intent is effectively
determined let's just jump you forward
in time skip you forward well well
what's the intuition and what's the
technical details of how to do next
cursor prediction that jump that's not
that's not so intuitive I think to
people yeah I think I can speak to a few
of the details on how how to make these
things work they're incredibly low
latency so you need to train small
models on this on this task um in
particular they're incredibly pre-fill
token hungry what that means is they
have these really really long prompts
where they see a lot of your code and
they're not actually generating that
many tokens and so the perfect fit for
that is using a sparse model meaning Ane
model um so that was kind of one one
break one breakthrough we made that
substantially improved its performance
at longer context the other being um a
variant of speculative decoding that we
we kind of built out called speculative
edits um these are two I think important
pieces of what make it quite high
quality um and very fast okay soe
mixture of experts the input is huge the
output is small yeah okay so like what
what what else can you say about how to
make it like caching play a role in this
cashing plays a huge role M um because
you're dealing with this many input
tokens if every single keystroke that
you're typing in a given line you had to
rerun the model on all those tokens
passed in you're just going to one
significantly deg grade latency two
you're going to kill your gpus with load
so you need to you you need to design
the actual prompts use for the model
such that they're cach caching aware and
then yeah you need you need to re use
the KV cach across request just so that
you're spending less work less compute
uh again what are the things that tab is
supposed to be able to do kind of in the
near term just to like sort of Linger on
that generate code like fill empty
space Also edit code across multiple
lines yeah and then jump to different
locations inside the same file yeah and
then like hopefully jump to different
files also so if you make an edit in one
file and maybe maybe you have to go
maybe you have to go to another file to
finish your thought it should it should
go to the second file also yeah and then
the full generalization is like next
next action prediction like sometimes
you need to run a command in the
terminal and it should be able to
suggest the command based on the code
that you wrote too um or sometimes you
actually need to like it suggest
something but you you it's hard for you
to know if it's correct because you
actually need some more information to
learn like you need to know the type to
be able to verify that it's correct and
so maybe it should actually take you to
a place that's like the definition of
something and then take you back so that
you have all the requisite knowledge to
be able to accept the next completion Al
also providing the human the knowledge
yes right yeah can you integrate like I
just uh gotten to know a guy named Prime
Jen who I believe has an SS you can
order coffee via SSH
oh yeah oh we did that we did that uh so
can that also the model do that like
feed you and like yeah and provide you
with caffeine okay so that's the general
framework yeah and the the magic moment
would be
if it is programming is this weird
discipline where um sometimes the next
five minutes not always but sometimes
the next five minutes of what you're
going to do is actually predictable from
the stuff you've done recently and so
can you get to a world where that next 5
minutes either happens by you
disengaging and it taking you through or
maybe a little bit more of just you
seeing Next Step what it's going to do
and you're like okay that's good that's
good that's good that's good and you can
just sort of tap tap tap through these
big changes as we're talking about this
I should mention like one of the really
cool and noticeable things about cursor
is that there's this whole diff
interface situation going on so like the
model suggests with uh with the red and
the green of like here's how we're going
to modify the code and in the chat
window you can apply and it shows you
the diff and you can accept the diff so
maybe can you speak to whatever
direction of that we'll probably have
like four or five different kinds of
diffs uh so we we have optimized the
diff for for the autocomplete so that
has a different diff interface
than uh then when you're reviewing
larger blocks of code and then we're
trying to optimize uh another diff thing
for when you're doing multiple different
files uh and and sort of at a high level
the difference is for
when you're doing autocomplete it should
be really really fast to
read uh actually it should be really
fast to read in all situations but in
autocomplete it sort of you're you're
really like your eyes focused in one
area you you can't be in too many you
the humans can't look in too many
different places so you're talking about
on the interface side like on the
interface side so it currently has this
box on the side so we have the current
box and if it tries to delete code in
some place and tries to add other code
it tries to show you a box on the you
can maybe show it if we pull it up on
cursor. comom this is what we're talking
about so that it was like three or four
different attempts at trying to make
this this thing work where first the
attempt was like these blue crossed out
line so before it was a box on the side
it used to show you the code to delete
by showing you like uh like Google doc
style you would see like a line through
it then you would see the the new code
that was super distracting and then we
tried many different you know there was
there was sort of deletions there was
trying to Red highlight then the next
iteration of it which is sort of funny
Would you would hold the on Mac the
option button so it would it would sort
of highlight a region of code to show
you that there might be something coming
uh so maybe in this example like the
input and the value uh would get would
all get blue and the blue would to
highlight that the AI had a suggestion
for you uh so instead of directly
showing you the thing it would show you
that the AI it would just hint that the
AI had a suggestion and if you really
wanted to see it you would hold the
option button and then you would see the
new suggestion then if you release the
option button you would then see your
original code mhm so that's by the way
that's pretty nice but you have to know
to hold the option button yeah I by the
way I'm not a Mac User but I got it it
was it was it's a button I guess you
people
it's h you know it's again it's just
it's just nonintuitive I think that's
the that's the key thing and there's a
chance this this is also not the final
version of it I am personally very
excited for
um making a lot of improvements in this
area like uh we we often talk about it
as the verification problem where U
these diffs are great for small edits uh
for large edits or like when it's
multiple files or something it's um
actually
a little bit prohibitive to to review
these diffs and uh uh so there are like
a couple of different ideas here like
one idea that we have is okay you know
like parts of the diffs are important
they have a lot of information and then
parts of the diff um are just very low
entropy they're like exam like the same
thing over and over again and so maybe
you can highlight the important pieces
and then gray out the the not so
important pieces or maybe you can have a
model that uh looks at the the diff and
and sees oh there's a likely bug here I
will like Mark this with a little red
squiggly and say like you should
probably like review this part of the
diff um and ideas in in that vein I
think are exciting yeah that's a really
fascinating space of like ux design
engineering so you're basically trying
to guide the human programmer through
all the things they need to read and
nothing more yeah like optimally yeah
and you want an intelligent model to do
it like ly diffs Al diff algorithms are
they're like Al like they're just like
normal algorithms uh there's no
intelligence uh there's like
intelligence that went into designing
the algorithm but then there there's no
like you don't care if the if it's about
this thing or this thing uh and so you
want a model to to do this so I think
the the the general question is like M
these models are going to get much
smarter as the models get much smarter
uh the the changes they will be able to
propose are much bigger so as the
changes gets bigger and bigger and
bigger the humans have to do more and
more and more verification work it gets
more and more more hard like it's just
you need you need to help them out it
sort of I I don't want to spend all my
time reviewing
code uh can you say a little more across
multiple files div yeah I mean so GitHub
tries to solve this right with code
review when you're doing code review
you're reviewing multiple deaths cross
multiple files but like Arvid said
earlier I think you can do much better
than code review you know code review
kind of sucks like you spend a lot of
time trying to grock this code that's
often quite unfamiliar to you and it
often like doesn't even actually catch
that many bugs and I think you can
signific significantly improve that
review experience using language models
for example using the kinds of tricks
that AR had described of maybe uh
pointing you towards the regions that
matter
um I think also if the code is produced
by these language models uh and it's not
produced by someone else like the code
review experience is designed for both
the reviewer and the person that
produced the code in the case where the
person that produced the code is a
language model you don't have to care
that much about their experience and you
can design the entire thing around the
reviewer such that the reviewer's job is
as fun as easy as productive as possible
um and I think that that feels like the
issue with just kind of naively trying
to make these things look like code
review I think you can be a lot more
creative and and push the boundary and
what's possible just one one idea there
is I think ordering matters generally
when you review a PR you you have this
list of files and you're reviewing them
from top to bottom but actually like you
actually want to understand this part
first because that came like logically
first and then you want understand the
next part and um you don't want to have
to figure out that yourself you want a
model to guide you through the thing and
is the step of creation going to be more
and more natural language is the goal
versus with actual uh I think sometimes
I don't think it's going to be the case
that all of programming will be natural
language and the reason for that is you
know if I'm PR programming with swalla
and swall is at the computer and the
keyboard uh and sometimes if I'm like
driving I want to say to swallet hey
like implement this function and that
that works and then sometimes it's just
so annoying to explain to swalla what I
want him to do and so I actually take
over the keyboard and I show him I I
write like part of the example and then
it makes sense and that's the easiest
way to communicate and so I think that's
also the case for AI like sometimes the
easiest way to communicate with the AI
will be to show an example and then it
goes and does the thing everywhere else
or sometimes if you're making a website
for example the easiest way to show to
the a what you want is not to tell it
what to do but you know drag things
around or draw things um and yeah and
and like maybe eventually we will get to
like brain machine interfaces or
whatever and can of like understand what
you're thinking and so I think natural
language will have a place I think it
will not definitely not be the way most
people program most of the time I'm
really feeling the AGI with this editor
uh it feels like there's a lot of
machine learning going on underneath
tell tell me about some of the ml stuff
that makes it all work recursor really
works via this Ensemble of custom models
that that that we've trained alongside
you know the frontier models that are
fantastic at the reasoning intense
things and so cursor tab for example is
is a great example of where you can
specialize this model to be even better
than even Frontier models if you look at
evls on on the on the task we set it at
the other domain which it's kind of
surprising that it requires custom
models but but it's kind of necessary
and works quite well is in apply
um
so I think these models are like the
frontier models are quite good at
sketching out plans for code and
generating like rough sketches of like
the change but
actually creating diffs is quite hard um
for Frontier models for your training
models um like you try to do this with
Sonet with 01 any Frontier Model and it
it really messes up stupid things like
counting line numbers um especially in
super super large file
um and so what we've done to alleviate
this is we let the model kind of sketch
out this rough code block that indicates
what the change will be and we train a
model to then apply that change to the
file and we should say that apply is the
model looks at your code it gives you a
really damn good suggestion of what new
things to do and the seemingly for
humans trivial step of combining the two
you're saying is not so trivial contrary
to popular perception it is not a
deterministic algorithm yeah I I I think
like you see shallow copies of apply um
elsewhere and it just breaks like most
of the time because you think you can
kind of try to do some deterministic
matching and then it fails you know at
least 40% of the time and that just
results in a terrible product
experience um I think in general this
this regime of you are going to get
smarter models and like so one other
thing that apply lets you do is it lets
you use fewer tokens with the most
intelligent models uh this is both
expensive in terms of latency for
generating all these tokens um and cost
so you can give this very very rough
sketch and then have your smaller models
go and implement it because it's a much
easier task to implement this very very
sketched out code and I think that this
this regime will continue where you can
use smarter and SM models to do the
planning and then maybe the
implementation details uh can be handled
by the less intelligent ones perhaps
you'll have you know maybe 01 maybe
it'll be even more cap capable models
given an even higher level plan that is
kind of recursively uh applied by Sonet
and then the apply model maybe we should
we should talk about how to how to make
it fast yeah I feel like fast is always
an interesting detail fast good yeah how
do you make it fast yeah so one big
component of making it it fast is
speculative edits so speculative edits
are a variant of speculative decoding
and maybe be helpful to briefly describe
speculative decoding um with speculative
decoding what you do is you you can kind
of take advantage of the fact that you
know most of the time and I I'll add the
caveat that it would be when you're
memory Bound in in language model
Generation Um if you process multiple
tokens at once um it is faster than
generating one Tok at a time so this is
like the same reason why if you look at
tokens per second uh with prompt tokens
versus generated tokens it's much much
faster for prompt tokens um so what we
do is instead of using what specul
decoding normally does which is using a
really small model to predict these
draft tokens that your larger model
would then go in and and verify um with
code edits we have a very strong prior
of what the existing code will look like
and that prior is literally the same
exact code so you can do is you can just
feed chunks of the original code back
into the into the model um and then the
model will just pretty much agree most
of the time that okay I'm just going to
spit this code back out and so you can
process all of those lines in parallel
and you just do this with sufficiently
many chunks and then eventually you'll
reach a point of disagreement where the
model will now predict text that is
different from the ground truth original
code it'll generate those tokens and
then we kind of will decide after enough
tokens match
uh the original code to re start
speculating in chunks of code what this
actually ends up looking like is just a
much faster version of normal editing
code so it's just like it looks like a
much faster version of the model
rewriting all the code so just we we can
use the same exact interface that we use
for for diffs but it will just stream
down a lot faster and then and then the
advantage is that W wireless streaming
you can just also be reviewing start
reviewing the code exactly before before
it's done so there's no no big loading
screen uh so maybe that that is part of
the part of the advantage so the human
can start reading before the thing is
done I think the interesting riff here
is something like like speculation is a
fairly common idea nowadays it's like
not only in language models I mean
there's obviously speculation in CPUs
and there's there like speculation for
databases and like speculation all over
the place let me ask the sort of the
ridiculous question of uh which llm is
better at coding GPT Claude who wins in
the context of programming and I'm sure
the answer is much more Nuance because
it sounds like every single part of this
involves a different
model yeah I think they there's no model
that poo dominates uh others meaning it
is better in all categories that we
think matter the categories being
speed
um ability to edit code ability to
process lots of code long context you
know a couple of other things and kind
of coding
capabilities the one that I'd say right
now is just kind of net best is Sonet I
think this is a consensus opinion our
one's really interesting and it's really
good at reasoning so if you give it
really hard uh programming interview
style problems or lead code problems it
can do quite quite well on them um but
it doesn't feel like it kind of
understands your rough intent as well as
son it
does like if you look at a lot of the
other Frontier models um one qual I have
is it feels like they're not necessarily
over I'm not saying they they train in
benchmarks um but they perform really
well in benchmarks relative to kind of
everything that's kind of in the middle
so if you tried on all these benchmarks
and things that are in the distribution
of the benchmarks they're valuated on
you know they'll do really well but when
you push them a little bit outside of
that son's I think the one that that
kind of does best at at kind of
maintaining that same capability like
you kind of have the same capability in
The Benchmark as when you try to
instruct it to do anything with coding
what another ridiculous question is the
difference between the normal
programming experience versus what
benchmarks represent like where do
benchmarks fall short do you think when
we're evaluating these models by the way
that's like a really really hard it's
like like critically important detail
like how how different like benchmarks
are versus where is like real coding
where real
coding it's not interview style coding
it's you're you're doing these you know
humans are saying like half broken
English sometimes and sometimes you're
saying like oh do what I did
before sometimes you're saying uh you
know go add this thing and then do this
other thing for me and then make this UI
element and then you know it's it's just
like a lot of things are sort of context
dependent
you really want to like understand the
human and then do do what the human
wants as opposed to sort of this maybe
the the way to put it is sort of
abstractly is uh the interview problems
are
very wellp
specified they lean a lot on
specification while the human stuff is
less
specified yeah I think that this this SP
for question is both Complicated by what
um Sol just mentioned and then also to
what Aman was getting into is that even
if you like you know there's this
problem of like the skew between what
can you actually model in a benchmark
versus uh real programming and that can
be sometimes hard to encapsulate because
it's like real programming is like very
messy and sometimes things aren't super
well specified what's correct or what
isn't but then uh it's also doubly hard
because of this public Benchmark problem
and that's both because public
benchmarks are sometimes kind of Hill
climbed on then it's like really really
hard to also get the data from the
public benchmarks out of the models and
so for instance like one of the most
popular like agent benchmarks sweet
bench um is really really contaminated
in the training data of uh these
Foundation models and so if you ask
these Foundation models to do a sweet
bench problem you actually don't give
them the context of a codebase they can
like hallucinate the right file pass
they can hallucinate the right function
names um and so the the it's it's also
just the public aspect of these things
is tricky yeah like in that case it
could be trained on the literal issues
or pool request themselves and and maybe
the lives will start to do a better job
um or they've already done a good job at
decontaminating those things but they're
not going to emit the actual training
data of the repository itself like these
are all like some of the most popular
python repositories like simpai is one
example I don't think they're going to
handicap their models on Senpai and all
these popular P python repositories in
order to get uh true evaluation scores
in these benchmarks yeah I think that
given the dirs and benchmarks
um there have been like a few
interesting crutches that uh places that
build systems with these models or build
these models actually use to get a sense
of are they going in the right direction
or not and uh in a lot of places uh
people will actually just have humans
play with the things and give
qualitative feedback on these um like
one or two of the foundation model
companies they they have people who
that's that's a big part of their role
and you know internally we also uh you
know qualitatively assess these models
and actually lean on that a lot in
addition to like private evals that we
have it's like the live
the vibe yeah the vi the vibe Benchmark
human Benchmark the hum you pull in the
humans to do a Vibe check yeah okay I
mean that's that's kind of what I do
like just like reading online forums and
Reddit and X just like well I don't know
how
to properly load in people's opinions
because they'll say things like I feel
like Claude or gpt's gotten Dumber or
something they'll say I feel like
and then I sometimes feel like that too
but I wonder if it's the model's problem
or mine yeah with Claude there's an
interesting take I heard where I think
AWS has different chips um and I I
suspect they've slightly different
numerics than uh Nvidia gpus and someone
speculated that claud's deg degraded
performance had to do with maybe using
the quantise version that existed on AWS
Bedrock versus uh whatever was running
on on anthropics gpus I interview a
bunch of people that have conspiracy
theories so I'm glad spoke spoke to this
conspiracy well it's it's not not like
conspiracy theory as much as they're
just they're like they're you know
humans humans are humans and there's
there's these details and you know
you're
doing like these quzy amount of flops
and you know chips are messy and man you
can just have bugs like bugs are it's
it's hard to overstate how how hard bugs
are to avoid what's uh the role of a
good prompt in all this see you mention
that benchmarks have
really uh structured well formulated
prompts what what should a human be
doing to maximize success and what's the
importance of what the humans you wrote
a blog post on you called it prompt
design yeah uh I think it depends on
which model you're using and all of them
are likly different and they respond
differently to different prompts but um
I think the original gp4 uh and the
original sort of bre of models last last
year they were quite sensitive to the
prompts and they also had a very small
context window and so we have all of
these pieces of information around the
codebase that would maybe be relevant in
the prompt like you have the docs you
have the files that you add you have the
conversation history and then there's a
problem like how do you decide what you
actually put in the prompt and when you
have a a limited space and even for
today's models even when you have long
context filling out the entire context
window means that it's slower it means
that sometimes a model actually gets
confused and some models get more
confused than others and we have this
one system internally that we call preum
which helps us with that a little bit um
and I think it was built for the era
before where we had
8,000 uh token context Windows uh and
it's a little bit similar to when you're
making a website you you sort of you you
want it to work on mobile you want it to
work on a desktop screen and you have
this uh Dynamic information which you
don't have for example if you're making
like designing a print magazine you have
like you know exactly where you can put
stuff but when you have a website or
when you have a prompt you have these
inputs and then you need to format them
will always work even if the input is
really big then you might have to cut
something down uh and and and so the
idea was okay like let's take some
inspiration what's the best way to
design websites well um the thing that
we really like is is react and the
declarative approach where you um you
use jsx in in in JavaScript uh and then
you declare this is what I want and I
think this has higher priority or like
this has higher Z index than something
else um and
then you have this rendering engine in
web design it's it's like Chrome and uh
in our case it's a pre renderer uh which
then fits everything onto the page and
and so you declaratively decide what you
want and then it figures out what you
want um and and so we have found that to
be uh quite helpful and I think the role
of it has has sort of shifted over time
um where initially was to fit to these
small context Windows now it's really
useful because you know it helps us with
splitting up the data that goes into the
prompt and the actual rendering of it
and so um it's easier to debug because
you can change the rendering of the
prompt and then try it on Old prompts
because you have the raw data that went
into the prompt and then you can see did
my change actually improve it for for
like this entire evil set so do you
literally prompt with jsx yes yes so it
kind of looks like react there are
components like we have one component
that's a file component and it takes in
like the cursor
like usually there's like one line where
the cursor is in your file and that's
like probably the most important line
because that's the one you're looking at
and so then you can give priorities so
like that line has the highest priority
and then you subtract one for every line
that uh is farther away and then
eventually when it's render it to figure
out how many lines can I actually fit
and it centers around that thing that's
amazing yeah and you can do like other
fancy things where if you have lots of
code blocks from the entire code base
you could use uh retrieval um and things
like embedding and reranking scores to
add priorities for each of these
components so should humans when they
ask questions also use try to use
something like that like would it be
beneficial to write jsx in the in the
problem where the whole idea is should
be loose and messy I I think our goal is
kind of that you should just uh do
whatever is the most natural thing for
you and then we are job is to figure out
how do we actually like retrieve the
relative EV things so that your thing
actually makes sense well this is sort
of the discussion I had with uh Arvin of
perplexity is like his whole idea is
like you should let the person be as
lazy as he want but like yeah that's a
beautiful thing but I feel like you're
allowed to ask more of programmers right
so like if you say just do what you want
I mean humans are lazy there's a kind of
tension between just being lazy versus
like provide more is uh be prompted
almost like the system
pressuring you or inspiring you to be
articulate not in terms of the grammar
of the sentences but in terms of the
depth of thoughts that you convey inside
the uh the problems I think even as a
system gets closer to some level of
perfection often when you ask the model
for something you just are not not
enough intent is conveyed to know what
to do and there are like a few ways to
resolve that intent one is the simple
thing of having model just ask you I'm
not sure how to do these parts based in
your query could you clarify that um I
think the other could be
maybe if you there are five or six
possible Generations given the
uncertainty present in your query so far
why don't we just actually show you all
of those and let you pick
them how hard is it to for the model to
choose to speak talk back sort of versus
gener that's a that's hard sort of like
how to deal with the
uncertainty do I do I choose to ask for
more information to reduce the ambiguity
so I mean one of the things we we do is
um it's like a recent addition is try to
suggest files that you can add so and
while you're typing uh one can guess
what the uncertainty is and maybe
suggest that like you know maybe maybe
you're writing your API
and uh we can guess using the
commits uh that you've made previously
in the same file that the client and the
server is super useful and uh there's
like a hard technical problem of how do
you resolve it across all commits which
files are the most important given your
current prompt and we still sort of uh
initial version is ruled out and I'm
sure we can make it much more
accurate uh it's it's it's very
experimental but then the ideaas we show
you like do you just want to add this
file this file this file also to tell
you know the model to edit those files
for you uh because if if you're maybe
you're making the API like you should
also edit the client and the server that
is using the API and the other one
resolving the API and so that would be
kind of cool as both there's the phase
where you're writing the prompt and
there's before you even click enter
maybe we can help resolve some of the
uncertainty to what degree do you use uh
agentic approaches how useful are agents
we think agents are really really cool
like I I I think agents is like uh it's
like resembles sort of like a human it's
sort of like the like you can kind of
feel that it like you're getting closer
to AGI because you see a demo where um
it acts as as a human would a
Resume
Read
file updated 2026-02-14 16:52:58 UTC
Categories
Manage