Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416
5t1vTLU7s40 • 2024-03-07
Transcript preview
Open
Kind: captions Language: en I see the danger of this concentration of power to to proprietary AI systems as a much bigger danger than everything else what works against this is people who think that for reasons of security we should keep AI systems under lock and key because it's too dangerous to put it in the hands of of everybody that would lead to a very bad future in which all of our information diet is controlled by a small number of uh uh companies through proprietary systems I believe that people are fundamentally good and so if AI especially open source AI can um make them smarter it just empowers the goodness in humans so I I share that feeling okay I think people are Fally good uh and in fact a lot of doomers are doomers because they don't think that people are fundamentally good the following is a conversation with Yan laon his third time on this podcast he is the chief AI scientist at meta professor at NYU touring Award winner and one of the seminal figures in the history of artificial intelligence he and meta AI have been big proponents of open sourcing AI development and have been walking the walk by open sourcing many of their biggest models including llama 2 and eventually llama 3 also Yan has been an outspoken critic of those people in the AI Community who warn about the looming danger and existential threat of AGI he believes the AGI will be created one day but it will be good it will not Escape human control nor will it Dominate and kill all humans at this moment of Rapid AI development this happens to be somewhat a controversial position and so it's been fun seeing Yan get into a lot of intense and fascinating discussions online as we do in this very conversation this is the lexman podcast to support it please check out our sponsors in the description and now dear friends here's Yan laon you've had some strong statements technical statements about the future of artificial intelligence recently throughout your career actually but recently as well uh you've said that autoaggressive llms are uh not the way we're going to make progress towards superhuman intelligence these are the large language models like GPT 4 like llama 2 and 3 soon and so on how do they work and why are they not going to take us all the way for a number of reasons the first is that there is a number of characteristics of intelligent behavior for example the capacity to understand the world understand the physical world the ability to remember and retrieve things um persistent memory the ability to reason and the ability to plan those are four essential characteristic of intelligent um systems or entities humans animals lnms can do none of those or they can only do them in a very primitive way and uh they don't really understand the physical world don't really have persistent memory they can't really reason and they certainly can't plan and so you know if if if you expect the system to become intelligent just you know without having the possibility of doing those things you're making a mistake that is not to say that auto regressive LS are not useful they're certainly useful um that they're not interesting that we can't build a whole ecosystem of applications around them of course we can but as a path towards human level intelligence they're missing essential components and then there is another tidbit or or fact that I think is very interesting those llms are trained on enormous amounts of text basically the entirety of all publicly available text on the internet right that's typically on the order of 10 to the 13 tokens each token is typically two byes so that's two 10 to the 13 bytes as training data it would take you or me 170,000 years to just read through this at eight hours a day uh so it seems like an enormous amount of knowledge right that those systems can accumulate um but then you realize it's really not that much data if you you talk to developmental psychologist and they tell you a four-year-old has been awake for 16,000 hours in his life um and the amount of information that has uh reached the visual cortex of that child in four years um is about 10 to the 15 bytes and you can compute this by estimating that the optical nerve carry about 20 megab megabytes per second roughly and so 10^ the 15 bytes for a four-year-old versus 2 * 10 to 13 bytes for 170,000 years worth of reading what it tells you is that uh through sensory input we see a lot more information than we than we do through language and that despite our intuition most of what we learn and most of our knowledge is through our observation and interaction with the real world not through language everything that we learn in the first few years of life and uh certainly everything that animals learn has nothing to do with language so it would be good to uh maybe push against some of of the intuition behind what you're saying so it is true there's several orders of magnitude more data coming into the human mind much faster and the human mind is able to learn very quickly from that filter the data very quickly you know somebody might argue your comparison between sensory data versus language that language is already very compressed it already contains a lot more information than the bytes it takes to store them if you compare it to visual data so there's a lot of wisdom and language there's words and the way we stitch them together it already contains a lot of information so is it possible that language alone already has enough wisdom and knowledge in there to be able to from that language construct a a world model and understanding of the world an understanding of the physical world that you're saying L LMS lack so it's a big debate among uh philosophers and also cognitive scientists like whether intelligence needs to be grounded in reality uh I'm clearly in the camp that uh yes uh intelligence cannot appear without some grounding in uh some reality doesn't need to be you know physical reality could be simulated but um but the environment is just much richer than what you can express in language language is a very approximate representation of our percepts and our mental models right I mean there there's a lot of tasks that we accomplish where we manipulate uh a mental model of uh of the situation at hand and that has nothing to do with language everything that's physical mechanical whatever when we build something when we accomplish a task model task of you know grabbing something Etc we plan or action sequences and we do this by essentially Imagining the result of the outcome of sequence of actions so we might imagine and that requires mental models that don't have much to do with language and that's I would argue most of our knowledge is derived from that interaction with the physical world so a lot of a lot of my my colleagues who are more uh interested in things like computer vision are really on that camp that uh AI needs to be embodied essentially and then other people coming from the NLP side or maybe you know some some other uh motivation don't necessarily agree with that um and philosophers are split as well uh and the U the complexity of the world is hard to um hard to imagine it you know it's hard to represent uh all the complexities that we take completely for granted in the real world that we don't even imagine require intelligence right this is the old marac Paradox from the pioneer of Robotics and SMC we said you know how is it that with computers it seems to be easy to do high Lev complex tasks like playing chess and solving integrals and doing things like that whereas the thing we take for granted that we do every day um like I don't know learning to drive a car or you know grabbing an object we can do as computers um and you know we have llms that can pass pass the bar exam so they must be smart but then they can't learn to drive in 20 hours like any 17y old they can't learn to clear out the dinner table and F of the dishwasher like any 10-year-old can learn in one shot um why is that like you know what what are we missing what what type of learning or or reasoning architecture or whatever are we missing that um um basically prevent us from from you know having level five sing Cars and domestic robots can a large language model construct a world model that does know how to drive and does know how to fill a dishwasher but just doesn't know how to deal with visual data at this time so it it can operate in space of Concepts so yeah that's what a lot of people are working on so the answer the short answer is no and the more complex sensor is you can use all kind of tricks to get uh uh an llm to basically digest U visual representations of representations of images uh or video or audio for that matter um and uh a classical way of doing this is uh you train a vision system in some way and we have a number of ways to train Vision systems either supervised semisupervised self superise all kinds of different ways uh that will turn any image into high level representation basically a list of tokens that are really similar to the kind of tokens that uh typical llm takes as an input and then you just feed that to the llm in addition to the text and you just expect LM to kind of uh you know during training to kind of be able to uh use those representations to help make decisions I mean there been work along those line for for quite a long time um and now you see those systems right I mean there are llms that can that have some Vision extension but they're basically hacks in the sense that um those things are not like trained end to end to to handle to really understand the world they're not trained with video for example uh they don't really understand intuitive physics at least not at the moment so you don't think there's something special to about intuitive physics about sort of Common Sense reasoning about the physical space about physical reality that's that to you is a giant leap that llms are just not able to do we're not going to be able to do this with the type of llms that we are uh working with today and there's a number of reasons for this but uh the main reason is the way llm LMS are trained is that you you take a piece of text you remove some of the words in that text you Mass them you replace by replace them by blank markers and you train a gtic neural net to predict the words that are missing uh and if you build this neural net in a particular way so that it can only look at u words that are to the left of the one is trying to predict then what you have is a system that basically is trying to predict the next word in a text right so then you can feed it um a text a prompt and you can ask it to predict the next word it can never predict the next word exactly and so what it's going to do is uh produce a probability distribution over all the possible words in your dictionary in fact it doesn't predict words it predicts tokens that are kind of subword units and so it's easy to handle the uncertainty in the prediction there because there's only a finite number of possible words in the dictionary and you can just compute a distribution over them um then what you what the system does is that it it picks a word from that distribution of course there's a higher chance of picking words that have a higher probability within that distribution so you sample from that distribution to actually produce a word and then you shift that word into the input and so that allows the system not to predict the second word right and once you do this you shift it into the input Etc that's called Auto regressive prediction and which is why those llms should be called Auto regressive llms uh but we just call them LMS and there is a difference between this kind of process and a process by which before producing a word when you talk when you and I talk you and I are bilinguals M we think think about what we're going to say and it's relatively independent of the language in which we're going to say when we when we talk about like uh I don't know let's say a mathematical concept or something the kind of thinking that we're doing and the answer that we're planning to produce is not linked to whether we're going to see it in French or Russian or English chsky just rolled his eyes but I understand so you're saying that there's a a bigger abstraction that repes that's uh that goes before language yeah maps onto language right it's certainly true for a lot of thinking that we that we do is that obvious that we don't like you're saying your thinking is same in French as it is in English yeah pretty much yeah pretty much or is this like how how flexible are you like if if there's a probability distribution well it it depends what kind of thinking right if it's just uh if it's like producing puns I get much better in French than English about that no but so worse is an abstract representation of puns like is your humor an abstract like when you tweet and your tweets are sometimes a little bit spicy uh what's is there an abstract representation in your brain of a tweet before it maps onto English there is an abstract representation of uh Imagining the reaction of a reader to that uh text or you start with laughter and then figure out how to make that happen or figure out like a reaction you want to cause and and then figure out how to say it right so that it causes that reaction but that's like really close to language but think about like a math mathematical concept or um you know imagining you know something you want to build out of wood or something like this right the kind of thinking you're doing has absolutely nothing to do with language really like it's not like you have necessarily like an internal monologue in any particular language you're you're you know imagining mental models of of the thing right I mean if I if I ask you to like imagine what this uh water bottle will look like if I rotate it 90 degrees um that has nothing to do with language and so uh so clearly there is you know a more abstract level of representation uh in which we we do most of our thinking and we plan what we're going to say if the output is is you know uttered words as opposed to an output being uh you know muscle actions right um we we plan our answer before we produce it and LMS don't do that they just produce one word after the other instinctively if you want it's like it's a bit like the you know subconscious uh actions where you don't like you're distracted you're doing something you're completely concentrated and someone comes to you and you know asks you a question and you kind of answer the question you don't have time to think about the answer but the answer is easy so you don't need to pay attention you sort of respond automatically that's kind of what an llm does right it doesn't think about it sensor really uh it retrieves it because it's accumulated a lot of knowledge so it can retrieve some some things but it's going to just spit out one token after the other without planning the answer but you're making it sound just one token after the other one token at a time generation is uh bound to be simplistic but if the world model is sufficiently sophisticated that one token at a time the the most likely thing it generates is a sequence of tokens is going to be a deeply profound thing okay but then that assumes that those systems actually possess an internal World model so it really goes to the I I think the fundamental question is can you build a a really complete World model not complete but a uh one that has a deep understanding of the world yeah so can you build this first of all by prediction right and the answer is probably yes can you predict can you build it by predicting words and the answer is most probably no because language is very poor in terms or weak or low bandwidth if you want there's just not enough information there so building World models means observing the world and uh understanding why the world is evolving the way the way it is and then uh the the extra component of a world model is something that can predict how the world is going to evolve as a consequence of an action you might take right so what model really is here is my idea of the state of the world at time te here is an action I might take what is the predicted state of the world at Mt plus1 now that state of the world doesn't does not need to represent everything about the world it just needs to represent enough that's relevant for this planning of of the action but not necessarily all the details now here is the problem um you're not going to be able to do this with generative models so genery model has trained on video and we've tried to do this for 10 years you take a video show a system a piece of video and then ask you to predict the reminder of the video basically predict what's going to happen one frame at a time do the same thing as sort of the autoaggressive llms do but for video right either one FR at a time or a group of friends at a time um but yeah uh a large video model if you want uh the idea of of doing this has been floating around for a long time and at at Fair uh some colleagues and I have been trying to do this for about 10 years um and you can't you can't really do the same trick as with LM because uh you know llms as I said you can't predict exactly which word is going to follow a sequence of words we can predict the distribution over words now if you go to video what you would have to do is predict the distribution over all possible frames in a video and we don't really know how to do that properly uh we we do not know how to represent distributions over High dimensional continuous spaces in ways that are useful uh and and that's that there lies the main issue and the reason we can do this is because the world is incredibly more complicated and richer in terms of information than than text text is discret video is high dimensional and continuous a lot of details in this um so if I take a a video of this room uh and the video is you know a camera panning around MH um there is no way I can predict everything that's going to be in the room as I pan around the system cannot predict what's going to be in the room as the camera is panning maybe it's going to predict this is this is a room where there's a light and there is a wall and things like that it can't predict what the painting on the wall looks like or what the texture of the couch looks like certainly not the texture of the carpet so there's no way I can predict all those details so the the way to handle this is one way possibly to handle this which we've been working for a long time is to have a model that has what's called a latent variable and the latent variable is fed to an Nal net and it's supposed to represent all the information about the world that you don't perceive yet and uh that you need to augment uh the the system for the prediction to do a good job at predicting pixels including the you know fine texture of the of the carpet and the on a couch and and the painting on the wall um uh that has been a complete failure essentially and we've tried lots of things we tried uh just straight neural Nets we tried Gans we tried uh you know Vees all kinds of regularized Auto encoders we tried um many things we also tried those kind of methods to learn uh good representations of images or video um that could then be used as input to for example an image classification system mhm and that also was basically failed like all the systems that attempt to predict missing parts of an image or video um you know from a corrupted version of it basically so right take an image or a video corrupt it or transform it in some way and then try to reconstruct the complete video or image from the corrupted version and then hope that internally the system will develop a good representations of images that you can use for object recognition segmentation whatever it is is that has been essentially a complete failure and it works really well for text that's the principle that is used for LMS right so where is the failure exactly is that that it's very difficult to form a good representation of an image a good in like a good embedding of all all the important information in the image is it in terms of the consistency of image to image to image the image that forms the video like where what are the if we do a highlight reel of all the ways you failed what what's that look like okay so the reason this doesn't work uh is first of all I have to tell you exactly what doesn't work because there is something else that does work uh so the thing that does not work is training a system to learn representations of images by training it to reconstruct uh a good image from a corrupted version of it okay that's what doesn't work and we have a whole slew of technique for this uh that are you know variant of ding Auto encoders something called Mee developed by some of my colleagues at Fair Max Doo encoder so it's basically like the you know llms or or or or things like this where you train the system by corrupting text except you corrupt images you remove Patches from it and you train a gigantic neet to reconstruct the features you get are not good and you know they're not good because if you now train the same architecture but you train it supervised mhm with with uh label data with Tex textual descriptions of images Etc you do get good representations and the performance on recognition tasks is much better than if you do this self-supervised free trining so the architecture is good the architecture is good the architecture of the encoder is good okay but the fact that you train the system to reconstruct images does not lead it to produce to learn good generic features of images when you train in a self-supervised way self-supervised by reconstruction Yeah by reconstruction okay so what's the alternative the alternative is joint embedding what is joint embedding what are what are these architectures that you're so excited about okay so now instead of training a system to encode the image and then training it to reconstruct the the full image from a corrupted version you take the full image you take the corrupted or transformed version you run them both through encoders mhm which in general are identical but not necessarily and then you you train a predictor on top of those uh encoders um to predict the representation of the full input from the representation of the corrupted one okay so joint embedding because you're you're taking the the full input and the corrupted version or transform version run them both through encoders so you get a joint embedding and then you and then you're you're saying can I predict the representation of the full one from the representation of the corrupted one okay um and I call this a JEA so that means joint embedding predictive architecture because it's joint embedding and there is this predictor that predicts the representation of the good guy from from the bad guy um and the big question is how do you train something like this uh and until five years ago or six years ago we didn't have particularly good answers for how you train those things except for one um called contrastive contrastive learning where um and the IDE contrastive learning is you you take a pair of images that are again an image and a corrupted version or degraded version somehow or transform version of the original one and you train the predicted representation to be the same as as that if you only do this the system collapses it basically completely ignores the input and produces representations that are con so the contrastive methods avoid this and and those things have been around since the early 90s had a paper on this in 1993 um is you also show pairs of images that you know are different and then you push away the representations from each other so you say not only do representations of things that we know are the same should be the same or should be similar but representation of things that we know are different should be different and that prevents the collapse but it has some limitation and there's a whole bunch of uh techniques that have appeared over the last six seven years um that can revive this this type of method um some of them from Far some of them from from Google and other places um but there are limitations to those contrasting method what has changed in the last uh you know three four years is now now we have methods that are non-contrastive so they don't require those negative contractive samples of images that are that we know are different you can only you turn them only with images that are you know different versions or different views of the same thing uh and you rely on some other tricks to prevent the system from collapsing and we have have a dozen different methods for this now so what is the fundamental difference between joint embedding architectures and llms so can uh can japa take us to AGI whether we should say that you don't like uh the term AGI and we'll probably argue I think every single time I've talked to you with argued about the G and AGI yes get I get it I get it we we'll probably continue to argue about it it's great uh you you like uh I me this because cuz you like French and um I me is is is uh I guess friend in French yes and Ami stands for advanced machine intelligence right um but either way can japa take us to that towards that advanced machine intelligence well so it's a it's a first step okay so first of all uh what What's the difference with generative architectures like llms um so llms um or Vision systems that are trained by reconstruction generate the inputs right they generate the original input that is non-corrupted non-transformed right so you have to predict all the pixels and there is a huge amount of resources spent in the system to actually predict all those pixels all the details uh in a jepa you're not trying to predict all the pixels you're only trying to predict an abstract representation of of the inputs right and that's much easier in many ways so what the japa system when it's being trained is trying to do is extract as much information as possible from the input but yet only EXT ract information that is relatively easily predictable okay so there's a lot of things in the world that we cannot predict like for example if you have a s driving car driving down the street or road uh there may be uh trees around the around the road and it could be a windy day so the the leaves on the tree are kind of moving in kind of semi chaotic random ways that you can't predict and you don't care you don't want to predict so what you want is your encoder to basically eliminate all those details will tell you there's moving leaves but it's not going to keep the details of exactly what's going on um and so when you do the prediction in representation space you're not going to have to predict every single Pixel of a relief and that you know um not only is a lot simpler but also it allows the system to essentially learn an abstract representation of of the world where you know what can be modeled and predicted is preserved and the rest is viewed as noise and eliminated by the encoder so it kind of lifts the level of abstraction of the representation if you think about this this is something we do absolutely all the time whenever we describe a phenomenon we describe it at a particular level of abstraction and we don't always describe every natural phenomenon in terms of quantum field Theory right that would be impossible right so we have multiple levels of abstraction to describe what happens in the world you know starting from Quantum field Theory to like atomic theory and molecules you know in chemistry materials and you know all the way up to you know kind of concrete objects in the real world and things like that so the we we can't just only model everything at the lowest level and that that's what the idea of JEA is really on is really about learn abstract representation in a self-supervised uh Manner and you know you can do it hierarchically as well so that I think is an essential component of an intelligent system and in language we can get away without doing this because language is already to some level abstract and already has eliminated a lot of information that is not predictable and um so we can get away without doing the tring without you know lifting the abstraction level and by directly predicting words so joint embedding it's still generative but it's generative in this abstract representation space yeah and you're saying language we were lazy with language cuz we already got the abstract representation for free and now we have to zoom out actually think about generally intelligent systems we have to deal with a full mess of physical reality of reality and you can't you you do have to do this step of jumping from uh the full Rich detailed reality to a uh abstract representation of that reality based on which you can then reason and all that kind of stuff right and the thing is those cell supervised algorithm that that learn by prediction even in representation space uh they learn more uh concept if the input data you Feit them is more redundant the more redundancy there is in the data the more they're able to capture some internal structure of it and so there there is way more redundancy in structure in perceptual uh inputs sensory input like like like Vision than there is in uh text which is not nearly as redundant this is back to the question you were asking a few minutes ago language might represent more information really because it's already compressed you're you're right about that but that means it's also less redundant and so self supervision will not work as well is it possible to join the self-supervised training on visual data and self-supervised training on language data there is a huge amount of knowledge even though you talk down about those 10 to the 13 tokens those 10 to the 13 tokens represent the entirety a large fraction of what US humans have figured out both the talk on Reddit and the contents of all the books and the Articles and the full spectrum of human uh intellectual creation so is it possible to join those two together well eventually yes but I think uh if we do this too early we run the risk of being tempted to cheat and in fact that's what people are doing at the moment with vision language model we're basically cheating we are using uh language as a crutch to help the deficiencies of our uh Vision systems to kind of learn good representations from uh images and video and uh the problem with this is that we might you know improve our uh visual language system a bit I mean our language models by you know feeding them image but we're not going to get to the level of even the intelligence or level of understanding of the world of a cat or dog which doesn't have language you know they don't have language and they understand the world much better than any llm they can plan really complex actions and sort of imagine the result of a bunch of actions how do we get machines to learn that before we combine that with language obviously if we combine this with language this is going to be a winner um but but before that we have to focus on like how do we get systems to learn how the world works so this kind of joint embedding predictive architecture for you that's going to be able to learn something like Common Sense something like what a cat uses to predict how to mess with its owner most optimally by knocking over a thing that's that's the Hope in fact the techniques we're using are non-contrastive uh so not only is the architecture non generative the learning procedures we're using are non contrastive we have two two sets of techniques one set is based on distillation and there's a number of uh methods that use this principle uh one by Deep Mind Bol a couple by by Fair one one called uh VRA and another one called IA and vcra I should say is not a distillation method actually but IA and B certainly are and there's another one also called Dino or dyo also produced from at fair and the idea of those things is that you take the full input let's say an image uh you run it through an encoder uh produces a representation and then you corrupt that input or transform it running to the essentially what amounts to the same encoder with some minor differences and then train a predictor sometimes to predictor is very simple sometime doesn't exist but train a predictor to predict a representation of the first first uh uncorrupted input from the corrupted input um but you only train the the second Branch um you only train the part of the network that is fed with the corrupted input the other network you don't you don't train but since they share the same weight when you modify the first one it also modifies the second one uh and with various tricks you can prevent the system from collapsing uh with the collapse of the type I was explaining before where the system basically ignores the input um so that works very well the the technique with the two techniques we develop at Fair uh dino and uh and IA work really well for that so what kind of data are we talking about here so this the several scenario one uh one scenario is you take an image you corrupt it by um changing the cropping for example changing the size a little bit maybe changing the orientation blurring it changing the colors doing all kinds of horrible things to it but basic horrible things basic horrible things that sort of degrade the quality a little bit and change the framing uh you know crop the image um or and in some cases in the case of a JEA you don't need to do any of this you just you just mask some parts of it right you just basically remove some regions like a big block essentially and and then you know run through the encoders um and train the entire system and and predictor to predict the representation of the good one from the representation of the corrupted one um so that's the Ia doesn't need to know that it's an image for example because the only thing it needs to know is how to do this masking um whereas with doo you need to know it's an image because you need to do things like you know geometri transformation and blurring and things like that that are really image specific uh a more recent version of of this that we have is called V JEA so is basically the same idea as I except um it's applied to video so now you take a whole video and you mask a whole chunk of it and what we mask is actually kind of a temple tube so an all like a whole uh segment of each frame in the video over the entire video and that tube was like statically position throughout the frames lit straight tube the tube yeah typically is 16 frames or something and we mask the same region over the entire 16 frames it's a different one for every video obviously and um and then again train that system so as to predict the representation of the full video from The partially matched video uh that works really well it's the first system that we have that learns good representations of video so that when you feed those representations to a supervised uh classifier head it can it can tell you what action is taking place in the video with you know pretty good accuracy um so that that's it's the first time we get something of that uh of that quality so that that's a a good test that a good representation is formed that means there's something to this yeah um we also preliminary result that seem to indicate that the representation allows us allow our system to tell whether the video is physically possible or completely impossible because some object disappeared or an object you know suddenly jumped from one location to another or or change shape or something so it's able to capture some physical con some physic based constraints about the reality represented in the video yeah about the appearance and The Disappearance of objects yeah that's really you okay but C can this actually get us to this kind of uh World model that understands enough about the world to be able to drive a car uh possibly um this is going to take a while before we get to that point but um um and there are systems already you know everybody systems that are based on this uh idea uh and the what you need for this is a slightly modified version of this where um imagine that you have uh a video and the a complete video and what you're doing to this video is that you're either translating it in time towards the future so you only see the beginning of the video but you don't see the latter part of it that is in the original one or you just mask the second half of the video for example um and then you you train a JEA system of the type I describe to predict the representation of the full video from the the shifted one but you also feed the predictor with an action for example you know the wheel is turned 10 degrees to the to the right or something right so if it's a you know a dash cam in a car and you know the angle of the wheel you should be able to predict to some extent what's going what's going to go what's going to happen to which to see uh you're not going to be able to predict all the details of you know objects that appear in the view obviously but at a abstract representation level you can you can probably predict what's going to happen so now what you have is a internal model that says here is my idea of state of the world at time T here is an action I'm taking here's a prediction of the state of the world at time t plus one t plus Delta t t plus 2 seconds whatever it is if you have a model of this type you can use it for planning so now you can do what llms cannot do which is planning what you're going to do so as to arrive at a particular uh outcome or satisfy a particular objective right so you can have a number of objectives um right if you know I can I can predict that uh if I have uh an object like this right and I open my hand it's going to fall right and uh and if if I push it with a particular force on the table it's going to move if I push the table itself it's probably not going to move uh with the same Force um so we have we have this internal model of the world in our in our mind uh which allows us to plan sequences of actions to arrive at a particular goal um and so um so now if you have this world model we can imagine a sequence of actions predict what the outcome of the sequence of action is going to be measure to what extent the final State satisfies a particular objective like you know moving the bottle to the left of the table um and then plan a sequence of actions that will minimize this objective at run time we're not talking about learning we're talking about inference time right so this is planning really and in optimal control this is a very classical thing it's called Uh model predictive control you have a model of the system you want to control that you know can predict the sequence of State St corresponding to a sequence of commands and you're planning a sequence of commands so that according to your world model the the the end state of the system will uh satisfy an objectives that you fix this is the way uh you know rocket trajectories have been planned since computers have been around so since the early 60s essentially so yes for model predictive control but you also often talk about hierarchical planning can hierarchical planning emerge from this somehow well so no you you will have to build specific architecture to allow for hierarchical planning so hierarchical planning is absolutely necessary if you want to plan complex actions uh if I want to go from let's say from New York to Paris this the example I use all the time and I'm sitting uh in my office at NYU my objective that I need to minimize is my distance to Paris at a high level a very astract representation of my uh my location I would have to decompose this into two sub goals first one is um go to the airport second one is catch a plane to Paris okay so my sub goal is now uh going to the airport my objective function is my distance to the airport how do I go to the airport where I have to go in the street and H a taxi which you can do in New York um okay now I have another sub goal go down on the street uh well that means going to the elevator going down the elevator walk out the street how do I go to the elevator I have to uh stand up from my chair open the door of my office go to the elevator push push the button how do I get up from my chair like you know you can imagine going down all the way down to basically what amounts to millisecond by millisecond muscle control okay and obviously you're not going to plan your entire trip from New York to Paris in terms of millisecond by millisecond muscle control first that would be incredibly expensive but it will also be completely impossible because you don't know all the conditions of what's going to happen uh you know how long it's going to take to catch a taxi um or to go to the airport with traffic you know uh I mean you you would have to know exactly the condition of everything to be able to do this planning and you don't have the information so you you have to do this hierarchical planning so that you can start acting and then sort of replanning as you go and nobody really knows how to do this in AI um nobody knows how to train a system to learn the appropriate multiple levels of representation so that hierarchical planning Works does something like that already emerge so like can you use an llm state-ofthe-art llm to get you from New York to Paris by doing exactly the kind of detailed set of questions that you just did which is can you give me a highight a list of 10 steps I need to do to get from New York to Paris and then for each of those steps can you give me a list of 10 steps how I make that step happen and for each of those steps can you give me a list of 10 steps to make each one of those until you're moving your mus individual muscles uh maybe not whatever you can actually act upon using your mind right so there's a lot of questions that are sort implied by this right so the first thing is llms will be able to answer some of those questions down to some level of exraction under the condition that they've been trained with similar scenarios in their training set they would be able to answer all those questions but some of them may be hallucinated meaning non-factual yeah true I mean they will probably produce some answer except they're not going to be able to really kind of produce millisecond by millisecond muscle control of how you how you stand up from your chair right so but down to some level of exraction we can describe things by words they might be able to give you a plan but only under the condition that they've been trained to produce those kind of plans mhm right they're not going to be able to plan for situations where that that they never encountered before they basically are going to have to regurgitate the template that they've been trained on but where like just for the example of New York to Paris is is it going to start getting into trouble like at which layer layer of abstraction do you think you'll start cuz like I can imagine almost every single part of that anal will be able to answer somewhat accurately especially when you're talking about New York and Paris major cities so I mean certainly uh LM would be able to solve that problem if you f tun need for it you know just uh and and so uh I can't say that nlm cannot do this it can do this if you train it for it there's no question uh down to a certain level where things can be formulated in terms of words but like if you want to go down to like how do you you know climb down the stairs or just stand up from your chair in terms of uh words like you you can't you can't do it um you you need that's one of the reasons you need experience of the physical world which is much higher bandwidth than what you can express in words in human language so everything we've been talking about on the joint embedding space is it possible that that's what we need for like the interaction with physical reality for on the robotics front and then just the llms are the thing that sits on top of it for the bigger reasoning about like yeah the fact that I need to book a plane ticket and I need to know I know how to go to the websites and so on sure and you know a lot of plans that people know about um that are relatively high level are actually learned they're not people most people don't invent the you know plans um uh they they by themselves they uh you know we have some ability to do this of course uh obviously but um but but most plants that people use are plants that they've been trained on like they've seen other people use those plants or they've been told how to do things right um that you can't invent how you like take a person who's never heard of airplanes and tell them like how do you go from New York to Paris and they're probably not going to be able to kind of you know deconstruct the whole plan unless they've seen examples of that before um so certainly LMS are going to be able to do this but but then um how you link this from the the low level of of of actions uh that needs to be done with things like like Jad that basically lift the abstraction level of the representation without attempting to reconstruct every detail of the situation that's why we need Jass for I would love to sort of Linger on your skepticism around uh autoaggressive llms so one way I would like to test that skepticism is everything you say makes a lot of sense but if I apply everything you said today and in general to like I don't know 10 years ago maybe a little bit less no let's say three years ago I wouldn't be able to predict the uh success of llms so does it make sense to you that autoaggressive llms are able to be so damn good yes can you explain your intuition because if I were to take your wisdom and intuition at face value I would say there's no way autoaggressive LMS one token at a time would be able to do the kind of things they're doing no there's one thing that auto llms uh or that llms in general not just the autoaggressive one but including the birth style bir directional ones uh are exploiting and it's self-supervised learning and I've been a very very strong advocate of self supervising for many years so those things are a incredibly impressive demonstration that cell supervisor learning actually works uh the idea that you know started uh it didn't start with with uh with Bert but it was really kind of a good demonstration with this so the the the idea that you know you take a piece of text you corrupt it and then you train some gigantic neural net to reconstruct the parts that are missing um that has been an enormous uh produced an enormous amount of benefits uh it allowed allowed us to create systems that understand understand language uh systems that can translate um hundreds of languages in any direction systems that are multilingual so they're not it's a single system that can be trained to understand hundreds of languages and translate in any direction um and produce summaries um and then answer questions and produce text and then there's a special case of it where you know you which is the auto Progressive uh trick where you constrain the system to not elaborate a representation of the text from looking at the enti text but only predicting a word from the words that are come before right and you do this by the constraining the architecture of the network and that's what you can build an auto regressive ATM from so there was a surprise many years ago with what's called decoder only llm so since you know systems of this type that are just trying to produce uh words from the from the previous one and and the fact that when you scale them up they they tend to really kind of understand more about the about language when you train them on lot of data and you make them really big that was kind of a surprise and that surprise occurred quite a while back like you know uh with uh work from uh you know Google meta open AI Etc you know going back to you know the GPT kind of uh work General pre-train Transformers do you mean like gbt2 like there's a certain place where you start to realize scaling might actually keep giving us a an emergent benefit yeah I mean there were there were work from from various places but uh uh if if you want to kind of you know place it in the in the GPT uh timeline that would be around gpt2 yeah well I just cuz you said it you're you're so charismatic you said so many words but self-supervised learning yeah yes but again the same intuition you're applying to saying that autor regressive llms cannot have a deep understanding of the world if we just apply that same intuition does it make sense to you that they're able to form enough of a representation of the world to be damn convincing essentially passing the original touring test with flying colors well we're fooled by their fluency right we just assume that if a system is is fluent in manipulating language then it has all the characteristics of human intelligence but that impression is false we we we're really fooled by it um what do you think alen tan would say it without understanding anything just hanging out with it an Turing would decide that a Turing test is a really bad test okay this is what the AI Community has decided many years ago that the tring test was a really bad test of intelligence what would Hans marvac say about the about the large language models hence Marv would say the Marv Paradox still applies okay okay okay we can pass you don't think he would be really impressed no of course everybody would be impressed but uh you know uh it's not a question of being impressed or not it's a question of knowing what the limit of those systems can do like there again they are impressive they can do a lot of useful t
Resume
Categories