Is V-JEPA the End of the LLM Era? Yann LeCun's New Vision for AI
kjsw2JTn7jY • 2026-01-01
Transcript preview
Open
Kind: captions Language: en Okay, so let's just jump right in. For the last few years, the world of AI has been all about one thing, right? The large language model. But today, we are looking at something that asks a pretty wild question. What if that entire approach, the very foundation of models like chat, GPT, and Gemini is actually a dead end. Meta's chief AI scientist, the Turing Award winner, Yan Lun, just co-authored a paper that feels less like a small update and more like a a quiet revolution. We're not just talking about a better model here. We're talking about a fundamentally different way to build intelligence itself. A way that cares more about understanding the world than it does about just predicting the next word. And honestly, this could change everything we thought we knew about the path to AGI. And that really gets us to the heart of this whole deep dive. Are we standing on the edge of a massive shift? I mean, LLMs have given us some incredible, almost magical stuff. They can write code, draft legal documents, even whip up poetry. But they also have these huge fundamental weaknesses. They hallucinate. They have zero common sense. And they just don't have a real grasp of the physical world. In a way, they're just brilliant mimics. So what if the future isn't about making the mimicry more perfect? What if it's about building something that doesn't have to mimic at all because it actually understands? What if there's a new kind of AI coming that's not just an upgrade but a completely different road altogether? This one quote. This is the bedrock of everything we're about to talk about. For years, while the whole world was obsessed with how well LLM could talk, Yan Lun has been hammering this exact point. His argument is that we've become fascinated with the exhaust fumes, the language, while completely ignoring the engine, a true internal model of how reality actually works. You know, think about how a baby learns. A baby doesn't learn about gravity by reading a textbook. It learns by dropping its spoon off the high chair a 100 times, watching what happens and building this intuitive non-verbal model of physics. The language to describe it that comes way, way later. Lun says that for AI to take that next big leap, it has to do the same thing. Understand first, talk sec. So, here's how we're going to break this down. First, we'll frame this new thing as a direct challenge to the LLM throne. Then, we're going to spend some real time on the key difference between thinking and generating. After that, we'll go inside the mind of this new model, VJBA. We'll look at the actual data to see if it really is smaller, faster, and smarter. Then, we'll zoom out to the big picture, the grand vision of a world model, cuz that's the real end goal here. And finally, just to keep it real, we'll look at some of the critiques and talk about where this tech is at today. All right, let's kick it off. Section one, it's time to officially meet the challenger in this story, V Jeepa. And it's so important to get this right. This is not just another horse in the LLM race. This is a model built on a whole different philosophy running in a completely different race. So yeah, the name is a total mouthful. Vision language, joint embedding, predictive architecture. But let's just unpack it real quick. Vision language. Okay, it connects what it sees with words. Joint embedding. That just means it learns to put images and text in the same idea space. Predictive architecture is about how it learns. But honestly, the most important words here aren't even in the name. They're non-generative. That is the key that unlocks this whole thing. See, unlike a model that's trained to just guess the next pixel or the next word, VJPAT is trained to predict a more abstract idea of the content. Its job is to build a rich internal concept, a real understanding of what it's seeing. The words, they're just a label you can stick on that understanding. Okay, on to section two. And this this is really the core of it all. To get what VG PA is doing, we have to really dig into this fundamental split in AI philosophy. The difference between generating an answer and actually forming an understanding. This is the paradigm shift we're talking about. Now, what's so wild here is the process itself. Generative AI literally has to talk to think. Imagine you ask it a complex question. It predicts the most likely first word. Let's say 'the'. Then based on your question and that word 'the', it predicts the next most likely word, maybe answer. It just keeps going like this, word by word, token by token, until it decides it's done. It's kind of like building a bridge one plank at a time without seeing the other side of the canyon. It just trusts that its rules for placing the next plank will get it there. It doesn't actually know the full answer until it's finished saying it. VJpaw's approach completely different. It looks a whole video or image and predicts a single holistic meaning vector. Think of it like a coordinate in a giant multi-dimensional thought space. That one vector is the understanding. It has all the rich info about the scene packed into it. The model thinks first in silence and then translating that thought into human language is a totally separate optional step. And this analogy just nails the difference. Using a generative AI is like brainstorming with someone who thinks out loud. They're exploring the idea as they talk. And sometimes they go down a weird path and say something that makes no sense. That's what we call a hallucination. It's discovery through speaking. But talking to a non-generative model is more like asking an expert for their opinion. The expert has already seen the situation, processed everything internally, and come to a stable conclusion. They aren't figuring it out on the fly. They already know. They're just waiting for you to ask for the summary. The confidence, the stability, the whole vibe, it's worlds apart. One is a stream of consciousness. The other is a settled conclusion. So, here's the bottom line. This isn't some minor technical tweak. It is a fundamental shift in how AI reasons. For years, we've been building systems that reason in tokens. Their whole world is made of little bits of language and their thought is just the statistical connection between those bits. The VJA approach suggests a future where AI reasons in meaning in a deeper, more abstract space. In this new world, language isn't the thought itself. It becomes what it is for us, a user interface for a much deeper non-llinguistic understanding of the world. Okay. So, what does this new way of thinking in meaning vectors actually look like in practice? Well, in section 3, we're going to try to peek inside the mind of VJPA as it watches a video, and we'll see how its ability to track meaning over time gives it a way more stable and coherent view of the world. So, first, let's look at the old way of doing things. A standard reactive vision model watches a video like a person with extreme short-term memory loss. It looks at frame one. Its pattern matchers go off and it shouts hand. Then the next frame comes. It totally forgets the first one and shouts bottle. It has no context, no memory of what happened a tenth of a second ago. Its output is just this jumpy, chaotic stream of guesses. It's not understanding an action. It's just reacting to a series of snapshots. This is why those older systems were so easy to fool and their descriptions felt so random. Here's a perfect analogy for it. The old model is like a cheap CCTV motion detector that just yells out the name of whatever object it thinks it sees every time a pixel changes. It's just noise. VGPA on the other hand is built to act more like a person. When you watch a short video clip, you don't narrate every single millisecond. You don't say hand moving, fingers extending, cylinder approaching. You just watch patiently. You put the information together over a few seconds and then you come to a clear highle conclusion. Ah, okay. He's picking something up. That ability to wait, watch, and synthesize is the key to moving from just seeing to actually understanding. So, how does it actually do this? Well, the model doesn't just spit out a final label. Internally, you can kind of picture its thought process as a cloud of possibilities. When a new action starts, it might have a bunch of initial low confidence guesses. In the demos, you see these as flickering red dots. Those are the model's first impressions. But as it sees more frames, it gathers more evidence. It sees the hand keep moving. The fingers close around the object. The object starts to lift. And as that evidence piles up, those scattered red dots of possibility start to merge and drift toward a single stable point in that meaning space. Once its confidence is high enough, it locks in. That's the blue dot. The blue dot is a stabilized understanding. The moment the AI says, "Okay, I'm now pretty sure the action picking up a canister just happened." This gives it a real sense of time of a before, a during, and an after. And hey, if you're geeking out over this breakdown of how these EI models work and you think the shift from just pattern matching to real understanding is as cool as I do, this is exactly the kind of stuff we get into every week. So, to make sure you don't miss our next deep dive into the tech that's literally shaping our world, take a second and hit that subscribe button. It seriously helps us out and it keeps you ahead of the curve. All right, let's get into section four. Now, everything we've talked about so far is a really cool theory, but in engineering and machine learning, theory is cheap. What matters are the results. So, does this elegant, more humanlike way of understanding actually perform any better? Let's look at the numbers and see if it really is smaller, faster, and smarter. This is the million-doll question, isn't it? A different way of doing things is interesting for researchers, but for the rest of us, what really matters is performance. Does thinking in meaning actually give you better, more accurate results than the old way of just thinking in words? Does this elegance actually translate to being more effective? Well, the paper gives us some pretty clear answers. Let's take a look at the scoreboard here. The paper runs these tests comparing VGPA to big powerful models like clip. The tasks are called zeroot or fshot learning, which is a really important test of an AI's ability to generalize. Basically, can it describe or classify a video of something it's never been explicitly trained on before? And the results are pretty stark. On things like video captioning and classification, the older models learn really slowly. VGPA, on the other hand, just pulls way ahead, learning a lot more from the same amount of data and getting to a higher quality of understanding much, much faster. This is pretty strong proof that its internal meaning vectors are just a more efficient way to learn about the world than just connecting pixels to words. Now, here's the part that's really going to blow your mind. Usually, when a new model comes out and crushes the old ones, you expect it to be some giant energy guzzling monster. So, how big is the VJA model they used in these tests? It's about 1.6 6 billion parameters. And that number is just it's incredible because it gets these better results with roughly half the number of trainable parameters of a lot of the traditional models it's up against. In the world of machine learning, that is the holy grail. It's like building a new car engine that's twice as powerful but gets twice the gas mileage. Getting way better performance from a model that's smaller, more efficient, and cheaper to train and run. That's a huge, huge win. It really suggests that this whole approach isn't just different. It's fundamentally better at getting the essence of the data. So this brings us to the really big idea in section 5. VJR's crazy efficiency and performance. That's not the end goal. It's a means to an end. This was never just about making better video classifiers. The ultimate goal here is something way more ambitious, something world changing, creating a true world model. Yan Lun just puts the state of AI in perfect perspective with this quote. We've built AIS that are amazing in the abstract world of language. They can pass the bar exam, which is all about manipulating text. And yet, we have completely failed to build AIs that can function in the physical world. A robot that can reliably clear your dinner table without breaking a plate, is still science fiction. A truly self-driving car that can learn to drive with the speed and intuition of a teenager. We are not there yet. And the reason for that gap is that these physical tasks need a deep predictive understanding of cause and effect, of physics, of how the world works. They need common sense and that's something language models just don't have. And that that is the ultimate goal of all of this research. The vision is to create an AI that learns an intuitive model of physics and causality just by watching the world. Not by memorizing equations from a textbook, but by watching thousands of hours of video and learning that things fall down, that liquid spill, that you can't walk through walls. It's about building a model that can predict not the next word in a sentence, but the next few frames of a video. An AI that can mentally play out what happens next is an AI that can plan, reason, and act safely in the real world. This is the missing piece for robotics and for true autonomy. This quote from Sonia Joseph, one of the Meta AI researchers, really gets to why this is so hard and why the Japa approach is so promising. A useful world model doesn't need to be a perfect physics simulator on a supercomput. I mean, it's impossible to simulate every atom in a room just to predict if a couple fall. We humans don't do that. We work with a simplified intuitive model of physics. We get concepts like gravity and momentum at just the right level of abstraction to make good predictions. The hope is that by training Japas to predict abstract ideas instead of raw pixels, they can also learn to find this efficient abstract level of understanding, capturing the important parts of physics without getting lost in the details. But let's bring it back down to earth for our last section. As exciting as all this sounds, VJA is not a magic wand. It is not a finished product and it definitely has its flaws. To really get the full picture, we need to look at the criticisms and understand where this model is at today on the long road to smarter AI. So, when Meta put out those demo videos, a really fair criticism started popping up on places like Reddit. People would pause the videos and point out that, hey, the real-time text descriptions were often just wrong or made no sense. And that's a totally valid observation of how it performs right now. But focusing only on that kind of misses the whole point of the research. The rebuttal here is super important. This paper is not a product launch. It's a proof of concept. The goal wasn't to release a model that's 100% accurate on day one. The goal was to show that a non-generative predictive approach can learn more efficiently and build better representations of the world than the other guys. It's about proving that this direction is a more promising path for the future, even if this first step is a little wobbly. and bringing you these balanced perspectives, showing you the incredible game-changing potential while also being honest about the real world limitations. That's what we're all about here. We think the smartest take comes from understanding both the hype and the reality. If you appreciate that kind of nuance when we talk about tech, hitting that subscribe button is the absolute best way to support what we do and make sure you get the full honest picture on these complex topics. So, we end on this final huge question. is VJ Paw and this whole Japa philosophy a turning point in the story of AI is the best path to smarter, more robust common sense AI not through building even bigger language models that are better at faking human text, but instead through building smaller, more efficient models that get better at developing a genuine predictive understanding of the world. This research argues that the answer is yes. It says we need to stop teaching AI to be clever talkers and start teaching them to be keen observers of reality. And even if this first step is flawed, it might just be the most important step taken in years. It points the entire field toward a new and maybe just maybe a much better destination. The real question is, are they right? What do you
Resume
Categories