Cursor Team: Future of Programming with AI | Lex Fridman Podcast #447
oFfVt3S51T4 • 2024-10-06
Transcript preview
Open
Kind: captions Language: en the following is a conversation with the founding members of the cursor team Michael truell swall oif Arvid lunark and Aman Sanger cursor is a code editor based on VSS code that adds a lot of powerful features for AI assisted coding it has captivated the attention and excitement of the programming and AI communities so I thought this is an excellent opportunity to dive deep into the role of AI in programming this is a super technical conversation that is bigger than just about one code editor it's about the future of programming and in general the future of human AI collaboration in designing and Engineering complicated and Powerful systems this is Le Freedman podcast to support it please check out our sponsors in the description and now dear friends here's Michael suale Arvid and Aman all right this is awesome we have Michael Aman suali Arvid here from the cursor team first up big ridiculous question what's the point of a code editor so the the code editor is largely the place where you build software and today or for a long time that's meant the place where you text edit uh a formal programming language and for people who aren't programmers the way to think of a code editor is like a really souped up word processor for programmers where the reason it's it's souped up is code has a lot of structure and so the the quote unquote word processor the code editor can actually do a lot for you that word processors you know sort of in the writing space haven't been able to do for for people editing text there and so you know that's everything from giving you visual differentiation of like the actual tokens in the code to so you can like scan it quickly to letting you navigate around the code base sort of like you're navigating around the internet with like hyperlinks you're going to sort of definitions of things you're using to error checking um to you know to catch rudimentary B um and so traditionally that's what a code editor has meant and I think that what a code editor is is going to change a lot over the next 10 years um as what it means to build software maybe starts to look a bit different I I think also code edor should just be fun yes that is very important that is very important and it's actually sort of an underated aspect of how we decide what to build like a lot of the things that we build and then we we try them out we do an experiment and then we actually throw them out because they're not fun and and so a big part of being fun is like being fast a lot of the time fast is fun yeah fast is uh yeah that should be a t-shirt like like fundamentally I think one of the things that draws a lot of people to to building stuff on computers is this like insane integration speed where you know in other disciplines you might be sort of gate capped by resources or or the ability even the ability you know to get a large group together and coding is just like amazing thing where it's you and the computer and uh that alone you can you can build really cool stuff really quickly so for people don't know cursor is this super cool new editor that's a fork of vs code it would be interesting to get your kind of explanation of your own journey of editors how did you I think all of you are were big fans of vs code with co-pilot how did you arrive to VSS code and how did that lead to your journey with cursor yeah um so I think a lot of us well all of us originally Vim users pure pure VI pure Vim yeah no neo just pure Vim in a terminal and at Le at least for myself it was around the time that C- pilot came out so 2021 that I really wanted to try it so I went into vs code the only platform the only code editor in which it was available and even though I you know really enjoyed using Vim just the experience of co-pilot with with vs code was more than good enough to convince me to switch and so that kind of was the default until we started working on cursor and uh maybe we should explain what copala does it's like a really nice autocomplete it suggests as you start writing a thing it suggests one or two or three lines how to complete the thing and there's a fun experience in that you know like when you have a close friendship and your friend completes your sentences like when it's done well there's an intimate feeling uh there's probably a better word than intimate but there's a there's a cool feeling of like holy it gets me now and then there's an unpleasant feeling when it doesn't get you uh and so there's that that kind of friction but I would say for a lot of people the feeling that it gets me over powers that it doesn't and I think actually one of the underrated aspects of get up copet is that even when it's wrong is it's like a little bit annoying but it's not that bad because you just type another character and then maybe then it gets you or you type another character and then then it gets you so even when it's wrong it's not that bad yeah you you can sort of iterate iterate and fix it I mean the other underrated part of uh calot for me sort of was just the first real real AI product it's like the first language model consumer product so copile was kind of like the first killer app for LMS yeah and like the beta was out in 2021 right okay mhm uh so what's the the origin story of cursor so around 2020 the scaling loss papers came out from from open Ai and that was a moment where this looked like clear predictable progress for the field where even if we didn't have any more ideas looked like you could make these models a lot better if you had more computer and more data uh by the way we'll probably talk uh for three to four hours on on the topic of scaling laws but just just to summarize it's a paper and a set of papers and set of ideas that say bigger might be better for model size and data size in the in the realm of machine learning it's bigger and better but predictively better okay this another topic of conversation but anyway yeah so around that time for some of us there were like a lot of conceptual conversations about what's this going to look like what's the the story going to be for all these different knowledge worker Fields about how they're going to be um made better U by this technology getting better and then um I think there were a couple of moments where like the theoretical gains predicted in that paper uh started to feel really concrete and it started to feel like a moment where you could actually go and not you know do a PhD if you wanted to work on uh do useful work in AI actually felt like now there was this this whole set of systems one could built that were really useful and I think that the first moment we already talked about a little bit which was playing with the early bit of copell like that was awesome and magical um I think that the next big moment where everything kind of clicked together was actually getting early access to gbd4 so sort of end of 2022 was when we were um tinkering with that model and the Step Up in capabilities felt enormous and previous to that we had been working on a couple of different projects we had been um because of co-pilot because of scaling laws because of our prior interest in the technology we had been uh tinkering around with tools for programmers but things that are like very specific so you know we were building tools for uh Financial professionals who have to work with in a juper notebook or like you know playing around with can you do static analysis with these models and then the Step Up in gbd4 felt like look that really made concrete the theoretical gains that um we had predicted before felt like you could build a lot more just immediately at that point in time and also if we were being consistent it really felt like um this wasn't just going to be a point solution thing this was going to be all of programming was going to flow through these models it felt like that demanded a different type of programming environment to different type of programming and so we set off to build that that sort of larger Vision around then there's one that I distinctly remember so my roommate is an IMO gold winner and uh there's a competition in the US called of putam which is sort of the IMO for college people and it's it's this math competition is he's exceptionally good so Shang Tong and Aman I remember it sort of June June of 2022 had this bet on whether the mo like 2024 June or July you were going to win a gold medal in the Imo with the with like models IMO is international math Olympiad uh yeah IMO is international math Olympiad and so Arvid and I are both of you know also competed in it so was sort of personal and uh and I I remember thinking Matt is just this is not going to happen this was like it un like even though I I sort of believed in progress I thought you know I'm a girl just like Aman is just delusional that was the that was the and and to be honest I mean I I was to be clear it very wrong but that was maybe the most preent bet in the group so the the new results from Deep Mind it turned out that you were correct that's what well it technically not technically incorrect but one point awayan was very enthusiastic about this stuff back then and before Aman had this like scaling loss t-shirt that he would walk around with where it had like charts and like the formulas on it oh so you like felt the AI or you felt the scaling yeah I i l remember there was this one conversation uh I had with with Michael where before I hadn't thought super deeply and critically about scaling laws and he kind of posed the question why isn't scaling all you need or why isn't scaling going to result in massive gains in progress and I think I went through like the like the stages of grief there is anger denial and then finally at the end just thinking about it uh acceptance um and I think I've been quite hopeful and uh optimistic about progress since I think one thing I'll caveat is I think it also depends on like which domains you're going to see progress like math is a great domain because especially like formal theor improving because you get this fantastic signal of actually verifying if the thing was correct and so this means something like RL can work really really well and I think like you could have systems that are perhaps very superhuman in math and still not technically have ai okay so can we take it off all the way to cursor mhm and what is cursor it's a fork of vs code and VSS code is one of the most popular editors for a long time like everybody fell in love with it everybody left Vim I left dmax for it sorry uh uh so it unified in some fun fundamental way the uh the developer community and then that you look at the space of things you look at the scaling laws AI is becoming amazing and you decide decided okay it's not enough to just write an extension Fe vs code because there's a lot of limitations to that we we need if AI is going to keep getting better and better and better we need to really like rethink how the the AI is going to be part of the editing process and so you decided to Fork vs code and start to build a lot of the amazing features we'll be able to to to talk about but what was that decision like because there's a lot of extensions including co-pilot of vs code that are doing so AI type stuff what was the decision like to just Fork vs code so the decision to do an editor seemed kind of self-evident to us for at least what we wanted to do and Achieve because when we started working on the editor the idea was these models are going to get much better their capabilities are going to improve and it's going to entirely change how you build software both in a you will have big productivity gains but also radical in how like the active building software is going to change a lot and so you're very limited in the control you have over a code editor if you're a plugin to an existing coding environment um and we didn't want to get locked in by those limitations we wanted to be able to um just build the most useful stuff okay well then the natural question is you know VSS code is kind of with copilot a competitor so how do you win is is it basically just the speed and the quality of the features yeah I mean I think this is a space that is quite interesting perhaps quite unique where if you look at previous Tech waves maybe there's kind of one major thing that happened and unlocked a new wave of companies but every single year every single model capability uh or jump you get model capabilities you now unlock this new wave of features things that are possible especially in programming and so I think in AI programming being even just a few months ahead let alone a year ahead makes your product much much much more useful I think the cursor a year from now will need to make the cursor of today look Obsolete and I think you know Microsoft has' done a number of like fantastic things but I don't think they're in a great place to really keep innovating and pushing on this in the way that a startup can just rapidly implementing features and and push yeah like and and kind of doing the research experimentation necessary um to really push the ceiling I don't I don't know if I think of it in terms of features as I think of it in terms of like capabilities for for programmers it's that like you know as you know the new one model came out and I'm sure there are going to be more more models of different types like longer context and maybe faster like there's all these crazy ideas that you can try and hopefully 10% of the crazy ideas will make it into something kind of cool and useful and uh we want people to have that sooner to rephrase it's like an underrated fact is we're making it for oursel when we started cursor you really felt this frustration that you know models you could see models getting better uh but the coall experience had not changed it was like man these these guys like the steing is getting higher like why are they not making new things like they should be making new things they should be like you like like where's where's where's all the alpha features there there were no Alpha features it was like uh I I'm sure it it was selling well I'm sure it was a great business but it didn't feel I I'm I'm one of these people that really want to try and use new things and was just there's no new thing for like a very long while yeah it's interesting uh I don't know how you put that into words but when you compare a cursor with copilot copilot pretty quickly became started to feel stale for some reason yeah I think one thing that I think uh helps us is that we're sort of doing it all in one where we're developing the the ux and the way you interact with the model and at the same time as we're developing like how we actually make the model give better answers so like how you build up the The Prompt or or like how do you find the context and for a cursor tab like how do you train the model um so I think that helps us to have all of it like sort of like the same people working on the entire experience on end yeah it's like the the person making the UI and the person training the model like sit to like 18 ft away so often the same person even yeah often often even the same person so you you can you create things that that are sort of not possible if you're not you're not talking you're not experimenting and you're using like you said cursor to write cursor of course oh yeah yeah well let's talk about some of these features let's talk about the all- knowing the all powerful praise B to the tab so the you know autocomplete on steroids basically so what how does tab work what is tab to highlight and summarize it a high level I'd say that there are two things that curser is pretty good at right now there there are other things that it does um but two things it it helps programmers with one is this idea of looking over your shoulder and being like a really fast colleague who can kind of jump ahead of you and type and figure out what you're what you're going to do next and that was the original idea behind that was kind kind of the kernel the idea behind a good autocomplete was predicting what you're going to do next you can make that concept even more ambitious by not just predicting the characters after cursor but actually predicting the next entire change you're going to make the next diff the next place you're going to jump to um and the second thing cursor is is pretty good at right now too is helping you sometimes jump ahead of the AI and tell it what to do and go from instructions to code and on both of those we've done a lot of work on making the editing experience for those things ergonomic um and also making those things smart and fast one of the things we really wanted was we wanted the model to be able to edit code for us uh that was kind of a wish and we had multiple attempts at it before before we had a sort of a good model that could edit code for you U then after after we had a good model I think there there have been a lot of effort to you know make the inference fast for you know uh having having a good good experience and uh we've been starting to incorporate I mean Michael sort of mentioned this like ability to jump to different places and that jump to different places I think came from a feeling off you know once you once you accept an edit um was like man it should be just really obvious where to go next it's like it's like I I made this change the model should just know that like the next place to go to is like 18 lines down like uh if you're if you're a whim user you could press 18 JJ or whatever but like why why even why am I doing this like the model the model should just know it and then so so the idea was you you just press tab it would go 18 lines down and then make it would show you show you the next edit and you would press tab so it's just you as long as you could keep pressing Tab and so the internal competition was how many tabs can we make them pressive once you have like the idea uh more more uh sort of abstractly the the thing to think about is sort of like once how how how are the edit sort of zero zero entropy so once You' sort of expressed your intent and the edit is there's no like new bits of information to finish your thought but you still have to type some characters to like make the computer understand what you're actually thinking then maybe the model should just sort of read your mind and and all the zero entropy bits should just be like tabbed away yeah that was that was sort of the abstract there's this interesting thing where if you look at language model loss on on different domains um I believe the bits per bite which is kind of character normalized loss for code is lower than language which means in general there are a lot of tokens in code that are super predictable lot of characters that are super predictable um and this is I think even magnified when you're not just trying to autocomplete code but predicting what the user is going to do next in their editing of existing code and so you know the gold cursor tab is let's eliminate all the low entropy actions you take inside of the editor when the intent is effectively determined let's just jump you forward in time skip you forward well well what's the intuition and what's the technical details of how to do next cursor prediction that jump that's not that's not so intuitive I think to people yeah I think I can speak to a few of the details on how how to make these things work they're incredibly low latency so you need to train small models on this on this task um in particular they're incredibly pre-fill token hungry what that means is they have these really really long prompts where they see a lot of your code and they're not actually generating that many tokens and so the perfect fit for that is using a sparse model meaning Ane model um so that was kind of one one break one breakthrough we made that substantially improved its performance at longer context the other being um a variant of speculative decoding that we we kind of built out called speculative edits um these are two I think important pieces of what make it quite high quality um and very fast okay soe mixture of experts the input is huge the output is small yeah okay so like what what what else can you say about how to make it like caching play a role in this cashing plays a huge role M um because you're dealing with this many input tokens if every single keystroke that you're typing in a given line you had to rerun the model on all those tokens passed in you're just going to one significantly deg grade latency two you're going to kill your gpus with load so you need to you you need to design the actual prompts use for the model such that they're cach caching aware and then yeah you need you need to re use the KV cach across request just so that you're spending less work less compute uh again what are the things that tab is supposed to be able to do kind of in the near term just to like sort of Linger on that generate code like fill empty space Also edit code across multiple lines yeah and then jump to different locations inside the same file yeah and then like hopefully jump to different files also so if you make an edit in one file and maybe maybe you have to go maybe you have to go to another file to finish your thought it should it should go to the second file also yeah and then the full generalization is like next next action prediction like sometimes you need to run a command in the terminal and it should be able to suggest the command based on the code that you wrote too um or sometimes you actually need to like it suggest something but you you it's hard for you to know if it's correct because you actually need some more information to learn like you need to know the type to be able to verify that it's correct and so maybe it should actually take you to a place that's like the definition of something and then take you back so that you have all the requisite knowledge to be able to accept the next completion Al also providing the human the knowledge yes right yeah can you integrate like I just uh gotten to know a guy named Prime Jen who I believe has an SS you can order coffee via SSH oh yeah oh we did that we did that uh so can that also the model do that like feed you and like yeah and provide you with caffeine okay so that's the general framework yeah and the the magic moment would be if it is programming is this weird discipline where um sometimes the next five minutes not always but sometimes the next five minutes of what you're going to do is actually predictable from the stuff you've done recently and so can you get to a world where that next 5 minutes either happens by you disengaging and it taking you through or maybe a little bit more of just you seeing Next Step what it's going to do and you're like okay that's good that's good that's good that's good and you can just sort of tap tap tap through these big changes as we're talking about this I should mention like one of the really cool and noticeable things about cursor is that there's this whole diff interface situation going on so like the model suggests with uh with the red and the green of like here's how we're going to modify the code and in the chat window you can apply and it shows you the diff and you can accept the diff so maybe can you speak to whatever direction of that we'll probably have like four or five different kinds of diffs uh so we we have optimized the diff for for the autocomplete so that has a different diff interface than uh then when you're reviewing larger blocks of code and then we're trying to optimize uh another diff thing for when you're doing multiple different files uh and and sort of at a high level the difference is for when you're doing autocomplete it should be really really fast to read uh actually it should be really fast to read in all situations but in autocomplete it sort of you're you're really like your eyes focused in one area you you can't be in too many you the humans can't look in too many different places so you're talking about on the interface side like on the interface side so it currently has this box on the side so we have the current box and if it tries to delete code in some place and tries to add other code it tries to show you a box on the you can maybe show it if we pull it up on cursor. comom this is what we're talking about so that it was like three or four different attempts at trying to make this this thing work where first the attempt was like these blue crossed out line so before it was a box on the side it used to show you the code to delete by showing you like uh like Google doc style you would see like a line through it then you would see the the new code that was super distracting and then we tried many different you know there was there was sort of deletions there was trying to Red highlight then the next iteration of it which is sort of funny Would you would hold the on Mac the option button so it would it would sort of highlight a region of code to show you that there might be something coming uh so maybe in this example like the input and the value uh would get would all get blue and the blue would to highlight that the AI had a suggestion for you uh so instead of directly showing you the thing it would show you that the AI it would just hint that the AI had a suggestion and if you really wanted to see it you would hold the option button and then you would see the new suggestion then if you release the option button you would then see your original code mhm so that's by the way that's pretty nice but you have to know to hold the option button yeah I by the way I'm not a Mac User but I got it it was it was it's a button I guess you people it's h you know it's again it's just it's just nonintuitive I think that's the that's the key thing and there's a chance this this is also not the final version of it I am personally very excited for um making a lot of improvements in this area like uh we we often talk about it as the verification problem where U these diffs are great for small edits uh for large edits or like when it's multiple files or something it's um actually a little bit prohibitive to to review these diffs and uh uh so there are like a couple of different ideas here like one idea that we have is okay you know like parts of the diffs are important they have a lot of information and then parts of the diff um are just very low entropy they're like exam like the same thing over and over again and so maybe you can highlight the important pieces and then gray out the the not so important pieces or maybe you can have a model that uh looks at the the diff and and sees oh there's a likely bug here I will like Mark this with a little red squiggly and say like you should probably like review this part of the diff um and ideas in in that vein I think are exciting yeah that's a really fascinating space of like ux design engineering so you're basically trying to guide the human programmer through all the things they need to read and nothing more yeah like optimally yeah and you want an intelligent model to do it like ly diffs Al diff algorithms are they're like Al like they're just like normal algorithms uh there's no intelligence uh there's like intelligence that went into designing the algorithm but then there there's no like you don't care if the if it's about this thing or this thing uh and so you want a model to to do this so I think the the the general question is like M these models are going to get much smarter as the models get much smarter uh the the changes they will be able to propose are much bigger so as the changes gets bigger and bigger and bigger the humans have to do more and more and more verification work it gets more and more more hard like it's just you need you need to help them out it sort of I I don't want to spend all my time reviewing code uh can you say a little more across multiple files div yeah I mean so GitHub tries to solve this right with code review when you're doing code review you're reviewing multiple deaths cross multiple files but like Arvid said earlier I think you can do much better than code review you know code review kind of sucks like you spend a lot of time trying to grock this code that's often quite unfamiliar to you and it often like doesn't even actually catch that many bugs and I think you can signific significantly improve that review experience using language models for example using the kinds of tricks that AR had described of maybe uh pointing you towards the regions that matter um I think also if the code is produced by these language models uh and it's not produced by someone else like the code review experience is designed for both the reviewer and the person that produced the code in the case where the person that produced the code is a language model you don't have to care that much about their experience and you can design the entire thing around the reviewer such that the reviewer's job is as fun as easy as productive as possible um and I think that that feels like the issue with just kind of naively trying to make these things look like code review I think you can be a lot more creative and and push the boundary and what's possible just one one idea there is I think ordering matters generally when you review a PR you you have this list of files and you're reviewing them from top to bottom but actually like you actually want to understand this part first because that came like logically first and then you want understand the next part and um you don't want to have to figure out that yourself you want a model to guide you through the thing and is the step of creation going to be more and more natural language is the goal versus with actual uh I think sometimes I don't think it's going to be the case that all of programming will be natural language and the reason for that is you know if I'm PR programming with swalla and swall is at the computer and the keyboard uh and sometimes if I'm like driving I want to say to swallet hey like implement this function and that that works and then sometimes it's just so annoying to explain to swalla what I want him to do and so I actually take over the keyboard and I show him I I write like part of the example and then it makes sense and that's the easiest way to communicate and so I think that's also the case for AI like sometimes the easiest way to communicate with the AI will be to show an example and then it goes and does the thing everywhere else or sometimes if you're making a website for example the easiest way to show to the a what you want is not to tell it what to do but you know drag things around or draw things um and yeah and and like maybe eventually we will get to like brain machine interfaces or whatever and can of like understand what you're thinking and so I think natural language will have a place I think it will not definitely not be the way most people program most of the time I'm really feeling the AGI with this editor uh it feels like there's a lot of machine learning going on underneath tell tell me about some of the ml stuff that makes it all work recursor really works via this Ensemble of custom models that that that we've trained alongside you know the frontier models that are fantastic at the reasoning intense things and so cursor tab for example is is a great example of where you can specialize this model to be even better than even Frontier models if you look at evls on on the on the task we set it at the other domain which it's kind of surprising that it requires custom models but but it's kind of necessary and works quite well is in apply um so I think these models are like the frontier models are quite good at sketching out plans for code and generating like rough sketches of like the change but actually creating diffs is quite hard um for Frontier models for your training models um like you try to do this with Sonet with 01 any Frontier Model and it it really messes up stupid things like counting line numbers um especially in super super large file um and so what we've done to alleviate this is we let the model kind of sketch out this rough code block that indicates what the change will be and we train a model to then apply that change to the file and we should say that apply is the model looks at your code it gives you a really damn good suggestion of what new things to do and the seemingly for humans trivial step of combining the two you're saying is not so trivial contrary to popular perception it is not a deterministic algorithm yeah I I I think like you see shallow copies of apply um elsewhere and it just breaks like most of the time because you think you can kind of try to do some deterministic matching and then it fails you know at least 40% of the time and that just results in a terrible product experience um I think in general this this regime of you are going to get smarter models and like so one other thing that apply lets you do is it lets you use fewer tokens with the most intelligent models uh this is both expensive in terms of latency for generating all these tokens um and cost so you can give this very very rough sketch and then have your smaller models go and implement it because it's a much easier task to implement this very very sketched out code and I think that this this regime will continue where you can use smarter and SM models to do the planning and then maybe the implementation details uh can be handled by the less intelligent ones perhaps you'll have you know maybe 01 maybe it'll be even more cap capable models given an even higher level plan that is kind of recursively uh applied by Sonet and then the apply model maybe we should we should talk about how to how to make it fast yeah I feel like fast is always an interesting detail fast good yeah how do you make it fast yeah so one big component of making it it fast is speculative edits so speculative edits are a variant of speculative decoding and maybe be helpful to briefly describe speculative decoding um with speculative decoding what you do is you you can kind of take advantage of the fact that you know most of the time and I I'll add the caveat that it would be when you're memory Bound in in language model Generation Um if you process multiple tokens at once um it is faster than generating one Tok at a time so this is like the same reason why if you look at tokens per second uh with prompt tokens versus generated tokens it's much much faster for prompt tokens um so what we do is instead of using what specul decoding normally does which is using a really small model to predict these draft tokens that your larger model would then go in and and verify um with code edits we have a very strong prior of what the existing code will look like and that prior is literally the same exact code so you can do is you can just feed chunks of the original code back into the into the model um and then the model will just pretty much agree most of the time that okay I'm just going to spit this code back out and so you can process all of those lines in parallel and you just do this with sufficiently many chunks and then eventually you'll reach a point of disagreement where the model will now predict text that is different from the ground truth original code it'll generate those tokens and then we kind of will decide after enough tokens match uh the original code to re start speculating in chunks of code what this actually ends up looking like is just a much faster version of normal editing code so it's just like it looks like a much faster version of the model rewriting all the code so just we we can use the same exact interface that we use for for diffs but it will just stream down a lot faster and then and then the advantage is that W wireless streaming you can just also be reviewing start reviewing the code exactly before before it's done so there's no no big loading screen uh so maybe that that is part of the part of the advantage so the human can start reading before the thing is done I think the interesting riff here is something like like speculation is a fairly common idea nowadays it's like not only in language models I mean there's obviously speculation in CPUs and there's there like speculation for databases and like speculation all over the place let me ask the sort of the ridiculous question of uh which llm is better at coding GPT Claude who wins in the context of programming and I'm sure the answer is much more Nuance because it sounds like every single part of this involves a different model yeah I think they there's no model that poo dominates uh others meaning it is better in all categories that we think matter the categories being speed um ability to edit code ability to process lots of code long context you know a couple of other things and kind of coding capabilities the one that I'd say right now is just kind of net best is Sonet I think this is a consensus opinion our one's really interesting and it's really good at reasoning so if you give it really hard uh programming interview style problems or lead code problems it can do quite quite well on them um but it doesn't feel like it kind of understands your rough intent as well as son it does like if you look at a lot of the other Frontier models um one qual I have is it feels like they're not necessarily over I'm not saying they they train in benchmarks um but they perform really well in benchmarks relative to kind of everything that's kind of in the middle so if you tried on all these benchmarks and things that are in the distribution of the benchmarks they're valuated on you know they'll do really well but when you push them a little bit outside of that son's I think the one that that kind of does best at at kind of maintaining that same capability like you kind of have the same capability in The Benchmark as when you try to instruct it to do anything with coding what another ridiculous question is the difference between the normal programming experience versus what benchmarks represent like where do benchmarks fall short do you think when we're evaluating these models by the way that's like a really really hard it's like like critically important detail like how how different like benchmarks are versus where is like real coding where real coding it's not interview style coding it's you're you're doing these you know humans are saying like half broken English sometimes and sometimes you're saying like oh do what I did before sometimes you're saying uh you know go add this thing and then do this other thing for me and then make this UI element and then you know it's it's just like a lot of things are sort of context dependent you really want to like understand the human and then do do what the human wants as opposed to sort of this maybe the the way to put it is sort of abstractly is uh the interview problems are very wellp specified they lean a lot on specification while the human stuff is less specified yeah I think that this this SP for question is both Complicated by what um Sol just mentioned and then also to what Aman was getting into is that even if you like you know there's this problem of like the skew between what can you actually model in a benchmark versus uh real programming and that can be sometimes hard to encapsulate because it's like real programming is like very messy and sometimes things aren't super well specified what's correct or what isn't but then uh it's also doubly hard because of this public Benchmark problem and that's both because public benchmarks are sometimes kind of Hill climbed on then it's like really really hard to also get the data from the public benchmarks out of the models and so for instance like one of the most popular like agent benchmarks sweet bench um is really really contaminated in the training data of uh these Foundation models and so if you ask these Foundation models to do a sweet bench problem you actually don't give them the context of a codebase they can like hallucinate the right file pass they can hallucinate the right function names um and so the the it's it's also just the public aspect of these things is tricky yeah like in that case it could be trained on the literal issues or pool request themselves and and maybe the lives will start to do a better job um or they've already done a good job at decontaminating those things but they're not going to emit the actual training data of the repository itself like these are all like some of the most popular python repositories like simpai is one example I don't think they're going to handicap their models on Senpai and all these popular P python repositories in order to get uh true evaluation scores in these benchmarks yeah I think that given the dirs and benchmarks um there have been like a few interesting crutches that uh places that build systems with these models or build these models actually use to get a sense of are they going in the right direction or not and uh in a lot of places uh people will actually just have humans play with the things and give qualitative feedback on these um like one or two of the foundation model companies they they have people who that's that's a big part of their role and you know internally we also uh you know qualitatively assess these models and actually lean on that a lot in addition to like private evals that we have it's like the live the vibe yeah the vi the vibe Benchmark human Benchmark the hum you pull in the humans to do a Vibe check yeah okay I mean that's that's kind of what I do like just like reading online forums and Reddit and X just like well I don't know how to properly load in people's opinions because they'll say things like I feel like Claude or gpt's gotten Dumber or something they'll say I feel like and then I sometimes feel like that too but I wonder if it's the model's problem or mine yeah with Claude there's an interesting take I heard where I think AWS has different chips um and I I suspect they've slightly different numerics than uh Nvidia gpus and someone speculated that claud's deg degraded performance had to do with maybe using the quantise version that existed on AWS Bedrock versus uh whatever was running on on anthropics gpus I interview a bunch of people that have conspiracy theories so I'm glad spoke spoke to this conspiracy well it's it's not not like conspiracy theory as much as they're just they're like they're you know humans humans are humans and there's there's these details and you know you're doing like these quzy amount of flops and you know chips are messy and man you can just have bugs like bugs are it's it's hard to overstate how how hard bugs are to avoid what's uh the role of a good prompt in all this see you mention that benchmarks have really uh structured well formulated prompts what what should a human be doing to maximize success and what's the importance of what the humans you wrote a blog post on you called it prompt design yeah uh I think it depends on which model you're using and all of them are likly different and they respond differently to different prompts but um I think the original gp4 uh and the original sort of bre of models last last year they were quite sensitive to the prompts and they also had a very small context window and so we have all of these pieces of information around the codebase that would maybe be relevant in the prompt like you have the docs you have the files that you add you have the conversation history and then there's a problem like how do you decide what you actually put in the prompt and when you have a a limited space and even for today's models even when you have long context filling out the entire context window means that it's slower it means that sometimes a model actually gets confused and some models get more confused than others and we have this one system internally that we call preum which helps us with that a little bit um and I think it was built for the era before where we had 8,000 uh token context Windows uh and it's a little bit similar to when you're making a website you you sort of you you want it to work on mobile you want it to work on a desktop screen and you have this uh Dynamic information which you don't have for example if you're making like designing a print magazine you have like you know exactly where you can put stuff but when you have a website or when you have a prompt you have these inputs and then you need to format them will always work even if the input is really big then you might have to cut something down uh and and and so the idea was okay like let's take some inspiration what's the best way to design websites well um the thing that we really like is is react and the declarative approach where you um you use jsx in in in JavaScript uh and then you declare this is what I want and I think this has higher priority or like this has higher Z index than something else um and then you have this rendering engine in web design it's it's like Chrome and uh in our case it's a pre renderer uh which then fits everything onto the page and and so you declaratively decide what you want and then it figures out what you want um and and so we have found that to be uh quite helpful and I think the role of it has has sort of shifted over time um where initially was to fit to these small context Windows now it's really useful because you know it helps us with splitting up the data that goes into the prompt and the actual rendering of it and so um it's easier to debug because you can change the rendering of the prompt and then try it on Old prompts because you have the raw data that went into the prompt and then you can see did my change actually improve it for for like this entire evil set so do you literally prompt with jsx yes yes so it kind of looks like react there are components like we have one component that's a file component and it takes in like the cursor like usually there's like one line where the cursor is in your file and that's like probably the most important line because that's the one you're looking at and so then you can give priorities so like that line has the highest priority and then you subtract one for every line that uh is farther away and then eventually when it's render it to figure out how many lines can I actually fit and it centers around that thing that's amazing yeah and you can do like other fancy things where if you have lots of code blocks from the entire code base you could use uh retrieval um and things like embedding and reranking scores to add priorities for each of these components so should humans when they ask questions also use try to use something like that like would it be beneficial to write jsx in the in the problem where the whole idea is should be loose and messy I I think our goal is kind of that you should just uh do whatever is the most natural thing for you and then we are job is to figure out how do we actually like retrieve the relative EV things so that your thing actually makes sense well this is sort of the discussion I had with uh Arvin of perplexity is like his whole idea is like you should let the person be as lazy as he want but like yeah that's a beautiful thing but I feel like you're allowed to ask more of programmers right so like if you say just do what you want I mean humans are lazy there's a kind of tension between just being lazy versus like provide more is uh be prompted almost like the system pressuring you or inspiring you to be articulate not in terms of the grammar of the sentences but in terms of the depth of thoughts that you convey inside the uh the problems I think even as a system gets closer to some level of perfection often when you ask the model for something you just are not not enough intent is conveyed to know what to do and there are like a few ways to resolve that intent one is the simple thing of having model just ask you I'm not sure how to do these parts based in your query could you clarify that um I think the other could be maybe if you there are five or six possible Generations given the uncertainty present in your query so far why don't we just actually show you all of those and let you pick them how hard is it to for the model to choose to speak talk back sort of versus gener that's a that's hard sort of like how to deal with the uncertainty do I do I choose to ask for more information to reduce the ambiguity so I mean one of the things we we do is um it's like a recent addition is try to suggest files that you can add so and while you're typing uh one can guess what the uncertainty is and maybe suggest that like you know maybe maybe you're writing your API and uh we can guess using the commits uh that you've made previously in the same file that the client and the server is super useful and uh there's like a hard technical problem of how do you resolve it across all commits which files are the most important given your current prompt and we still sort of uh initial version is ruled out and I'm sure we can make it much more accurate uh it's it's it's very experimental but then the ideaas we show you like do you just want to add this file this file this file also to tell you know the model to edit those files for you uh because if if you're maybe you're making the API like you should also edit the client and the server that is using the API and the other one resolving the API and so that would be kind of cool as both there's the phase where you're writing the prompt and there's before you even click enter maybe we can help resolve some of the uncertainty to what degree do you use uh agentic approaches how useful are agents we think agents are really really cool like I I I think agents is like uh it's like resembles sort of like a human it's sort of like the like you can kind of feel that it like you're getting closer to AGI because you see a demo where um it acts as as a human would a
Resume
Categories