Why the Future of LLMs is 8x Faster & Smarter | Deep Dive into SSMs & Nested Learning
ehNWO8v4CG0 • 2025-12-02
Transcript preview
Open
Kind: captions Language: en Welcome to the explainer. You know that engine that powers pretty much all of modern AI? Well, it's starting to sputter. Today, we're going to look at that foundational tech, the cracks that are starting to show and what might just might come next. Wow. Okay, so this is a pretty shocking statement, right? It comes from Leon Jones, one of the co-authors of that game-changing 2017 paper, Attention Is All You Need. You know, the paper that literally invented the transformer and kicked off this entire AI boom. So yeah, let's dive into this. To really get his frustration and what it means for the future of AI, we kind of have to understand the incredible worldchanging impact his invention had in the first place. So let's start with act one of our story. The undisputed dominance of this one architecture born from that 2017 paper, an architecture that just completely took over the entire field. I mean before the transformer progress in AI was slow incremental. But after it was exponential. This was the breakthrough that made things like large language models LLM even possible. It really became the fundamental building block for well everything we think of as modern AI. It's the DNA inside models like GPT4 and Gemini. It unlocked these weird and powerful emergent capabilities, these unexpected skills that just kind of pop up when the model gets complex enough. And of course, this drove billions in investment and made the transformer the unquestioned default way to build AI. And its success was built on this really simple, powerful idea. The bigger and deeper you build the model, the smarter it gets. But what if that core idea has a hidden critical flaw? And that brings us to act two, the conflict. This is where the cracks start to appear in the king's armor and not from the outside but from deep within the math of the model itself. So the central promise was always that stacking more and more layers, you know, making the model deeper would make it way more powerful. But some really recent research is flipping that whole idea on its head, showing that after a certain point deeper actually becomes weaker. A recent paper with the very provocative title attention is not all you need. They identified a fundamental problem with this deep stacking approach. They call it rank collapse. And honestly, the best way to think of it is like a game of telephone. You know how the first message is perfectly clear, but as it passes from person to person, or in this case from layer to layer, it gets all distorted and simplified. By the time that information travels through a ton of layers, the message gets so garbled that the signal is just lost in the noise. the deepest parts of the network, they basically stop learning anything new. Their output becomes no better than just a random guess. And check this out. The data actually proves it. Researchers tested how well information paths of different links performed. And as you can see, these short paths crossing just one or two layers, super effective. But look at the long paths, the ones crossing six layers, the accuracy just plummets down to a level that is barely better than a coin flip. I mean, this is a massive finding. It suggests there's a hard mathematical limit to that bigger is better approach. But this technical flaw, it's only half the story. The other problem, the one that really frustrates Lion Jones is a human one. Which brings us right back to the inventor Leon Jones and why he basically thinks the entire field has lost its way. He argues that this flood of money and talent into AI, it hasn't sparked creativity, it's actually killed it. The immense pressure from investors who are demanding quick returns has forced everyone to just double down on the one thing they know works, the transformer. And that's stifled any real fundamental innovation. And he is not alone in feeling this way. You go online, you look at discussions among researchers and they're all echoing this exact same thing. You see phrases like the myopic vision of industry or that it's become a race to the bottom focused on a shinier product, not a smarter model. One developer put it perfectly. He said, "The field feels stuck, just fine-tuning the same 2017 paper like it's the Bible." And this right here illustrates the problem perfectly. The entire focus of the research community isn't on inventing something new. It's on finding more efficient ways to patch the old model. They're trying to fix the engine, not design a whole new one. So, this is the critical question, right? If the king is flawed and the kingdom is afraid of change, where does the next revolution even come from? Well, this leads us to act three, a potential new path forward, one that learns from the mistakes of the past. So, frustrated by all this stagnation, some researchers are now looking for inspiration in the most complex and efficient learning machine we know of, the human brain. Okay, so one of the most promising new ideas is called nested learning. Instead of treating an AI as one giant single network, it reimagines it as a system of smaller interconnected modules. And each module learns at a different speed, kind of mimicking how our brains turn short-term experiences into long-term knowledge. [snorts] And by doing that, it tries to solve a huge problem in AI called catastrophic forgetting. Now, here's a comparison that really shows the difference. Current models, they're static. Their knowledge is frozen after training. And when they learn a new task, they often forget the old ones. Nested learning, on the other hand, aims to create systems that can learn continuously, consolidating memories and building a real spectrum of short and long-term knowledge, just like we do. And to prove this isn't just some theory, researchers actually went and built a whole new architecture from the ground up based on these ideas. They call it hope. And the results are incredibly promising. When they tested it on a bunch of common sense reasoning tasks, the hope architecture consistently and significantly beat a standard transformer model of about the same size. This is a fundamental shift. It's the difference between building static tools that we have to constantly throw away and replace and creating truly dynamic systems that can adapt, evolve, and improve all on their own over time. So, what are the key takeaways here? Well, first, the transformer, the king of modern AI, has real mathematical limits. Second, the industry's obsession with it has created a research bottleneck. And third, these new ideas inspired by neuroscience, like nested learning, are showing a potential way out. The era of just making AI bigger might be ending. And the era of making it smarter in totally new ways could be just beginning. This shift from just scaling up to truly scaling smarter could very well be the thing that defines the next decade of artificial intelligence. Thanks for joining us for this explainer.
Resume
Categories