Train LLMs for $5? DeepSeek’s mHC Breakthrough & The Blueberry 88M Project
9EEUbjf7Oig • 2026-01-08
Transcript preview
Open
Kind: captions
Language: en
Welcome to the explainer. Today we're
getting into the weeds on a totally new
way to build large language models. And
this whole thing is inspired by a really
ambitious mission to create a worldclass
top tier AI that is completely
open-source for everyone. So here's the
plan. We're going to start with the
quest that kicked this whole thing off.
Then we'll get a handle on the basic
tech that powers pretty much every LLM
out there. After that, we'll uncover a
brilliant, but as you'll see, deeply
flawed new idea. Then comes the really
cool part, the elegant mathematical fix.
We'll see how they engineered it to work
at a massive scale and then we'll zoom
out and look at what this all means for
the future of how we build AI. Okay, so
this all starts with the open super
intelligence lab. Now, these folks have
one single incredibly audacious goal.
Build one of the world's top 10 large
language models and then just give it
away. Keep it fully open. You know, this
isn't just about making another chatbot.
This is about democratizing the absolute
bleeding edge of artificial
intelligence. And trust me, they are not
messing around. Just look at their
public road map. It is an allout sprint
to catch up to and then blow past the
biggest models out there with the goal
of hitting the top 10 by the end of
2027. Now, that raises a huge question,
right? What kind of insane radical new
technology do you need to even try
something like that? You can't just do
what everyone else is doing. You need a
fundamental edge. And that is where our
story really gets interesting. So to
really get what's so revolutionary here,
we've got to go back to basics for a
second. We need to talk about the
fundamental plumbing that's inside
almost every single large neural network
today. It's called the residual
connection. And honestly, it completely
changed the game for training
ridiculously deep networks. Here's the
best way to think about it. Picture some
data, like a word from a sentence, going
into a layer of the network to get
processed. While that's happening, a
perfect untouched copy of that original
word takes an express lane right around
the processing block. At the other end,
the new processed version and the
original version just get added
together. Now, this is absolutely
critical. It means that even if a layer
learns nothing useful, the model doesn't
get dumber. The original signal just
passes straight through. It's the
ultimate safety net, and it's what lets
us build these models that are hundreds
of layers deep without them just falling
apart. And that leads to a really
important distinction we have to make.
You know, most of the progress you hear
about things like new attention
mechanisms or a mixture of experts,
that's all in what we call micro design.
It's like inventing a better fuel
injector, a more efficient part for the
engine. But the story we're telling
today, it's all about macro design. It's
a radical change to the car's entire
chassis. We're not just tuning up the
engine. We are fundamentally rethinking
how information flows through the entire
system. So with this focus on the big
picture on macro design, some
researchers at Bite Dance came up with a
wild idea back in 2025. They called it
hyper connections or HC. They looked at
that single express lane we just talked
about and they asked a really simple
question. Why just one? Why not build a
multi-lane superighway? Right? The core
idea is brilliant. You take the data for
a token and you expand it out into let's
say four parallel streams. The hope is
that each stream might specialize. You
know, maybe stream one gets really good
at grammar. Stream two holds the
longrange context of the conversation.
Stream three focuses on math. Now,
here's the genius part. Right before you
get to a really expensive part of the
model, like the attention layer, you use
a smart function to squish all four
streams down into one. You do all the
heavy lifting on that single condensed
stream and then you expand it back out
into four. So, you get the information
capacity of a four-lane highway, but
with the computational traffic of a
single lane road. It's super clever. And
on the surface, this sounds amazing,
right? It's a fantastic way to cram more
memory and more information into the
model. It's a genuinely clever idea. But
there's a catch. There's a hidden flaw
that you only see when you really,
really scale it up. So, what happens
when you try to build a 100tory
skyscraper with this shiny new dessert?
Well, this happens. It completely
breaks. I mean, it spectacularly fails.
What you're looking at is the training
progress of a massive 27 billion
parameter model using this design. And
for a while, everything looks great.
It's actually learning faster than the
standard models. But then, right around
the 12,000th training step, bam, the
loss just goes through the roof. The
whole learning process becomes totally
unstable and the model's performance
just collapses into absolute chaos. So,
what on earth is going on here? Well,
the problem is that we lost that
beautiful safety net from the original
design. The way the information is
getting mixed between the four streams
at each layer isn't controlled. So the
signals are getting amplified layer
after layer after layer. It's like
you're turning up the volume knob just a
tiny tiny bit, but you do it a 100 times
in a row. By the time you get to the
top, the signal is absolutely deafening.
This chart shows the signal gain, which
should be one, has rocketed to 3,000.
It's 3,000 times louder than it should
be. At that point, it's not a signal
anymore. It's just noise. The model is
basically just listening to static. This
is just a classic classic large-scale
engineering problem. You have a
brilliant idea on paper that just
completely shatters when it meets the
brutal reality of a truly massive
system. Now, how they fixed it is, in my
opinion, even more brilliant than the
original idea. And if you want to see
how researchers solve these kinds of
huge engineering puzzles, make sure
you're subscribed for more of these
explainers. Okay, so this is where
DeepSk AI comes into the picture with an
incredibly elegant solution. They call
it manifold constrained hyperconnections
or MHC. And they didn't throw out the
superighway idea. They knew it was
powerful. They just they added a traffic
controller, a very specific
mathematically perfect traffic
controller. And that traffic controller
has a really fancy name, the doubly
stochastic matrix. Now, I know that
sounds super complicated, but the idea
behind it is actually incredibly simple
and powerful. It's just a grid of
numbers where every single row adds up
to one and every single column also adds
up to one. And that one simple rule
guarantees that when you use it to mix
the information between your streams,
you cannot create or destroy signal
energy. You can only redistribute it.
The exploding signal problem gone. It's
a perfect fix. So how do you actually
force the model to learn a matrix with
this special property? Well, you use
this beautiful classic algorithm from
the 1960s called Synhorn KOP. It's
actually pretty simple. You take your
matrix and you force all the rows to add
up to one. Of course, that messes up the
columns. So then you force all the
columns to add up to one, which messes
up the rows again, but a little less
this time. And you just keep doing that
back and forth, back and forth. And
amazingly, it is mathematically
guaranteed to eventually settle on a
perfect doubly stochastic matrix. Okay,
let's make this super concrete. Let's
say we only have two streams. Stream one
has a really strong signal. Let's call
it 100 comma 100. Stream 2 is totally
empty. 0 comma 0. Now we apply our
special mixing matrix. To get the new
stream 1, we take 90% of the old stream
one and 10% from the old empty stream 2.
To get the new stream 2, we do the
opposite. 10% from the strong one, 90%
from the empty one. And look what
happens. The total signal 200 is
perfectly conserved. It's just been
redistributed. It's a perfect stable,
completely controlled leak of
information between the lanes. Okay, so
the math checks out. The theory is
beautiful. We have a stable way to get
all the benefits of this super highway
without the model blowing up. But, you
know, there's always a button. Theory is
one thing. Making this stuff actually
run efficiently on thousands of GPUs is
a whole other beast. And this new design
introduces a new problem. It's called
the memory wall. All this extra data
from the multiple streams has to be
shuffled around constantly and that can
create a massive traffic jam that slows
down the whole training process. And
this table really shows you why it's
such a big deal. With a standard model,
you're reading and writing a certain
amount of data for every token. With
hyperconnections where you've expanded
that data by four times, look at those
formulas. The amount of data being moved
around just skyrockets. This threatens
to make the model so slow to train that
it's completely impractical, no matter
how clever the math is. But, and this is
where it gets really impressive. The
Deep Seek team aren't just brilliant
theorists. They are worldclass
engineers. They attacked this problem
with everything they had. Doing things
like kernel fusion to reduce memory
trips, recomputing values on the fly
instead of storing them. Just all sorts
of clever low-level optimizations. And
the final result, this incredibly
complex new architecture adds only a
6.7% time overhead during training.
That's it. It's an absolute engineering
marvel. Okay, so let's recap. We have a
stable theory. We have some hardcore
engineering that makes it run
efficiently. But the million-doll
question is still on the table. Does it
actually work any better? Does it make
the model smarter? And the answer,
thankfully, is a resounding yes. This is
the proof in the pudding. On a 27
billion parameter model, MHC doesn't
just beat the standard design, it also
outperforms the original unstable
version even before it blew up. And
what's really telling is where it gets
better. On tasks that require complex
reasoning, like BBH or tough reading
comprehension, it shows really
significant gains. This tells us that
the stable principled way of mixing
information isn't just about preventing
explosions. It's actively leading to a
more intelligent model. And this is
where we loop all the way back to the
beginning, back to the open super
intelligence lab. A fundamental
breakthrough like this is exactly what a
team like that needs. It's not some
small incremental tweak. It is a
foundational change to the architecture
that lets them scale better and get more
performance. It's what turns their
ambitious goal from a wild dream into a
plausible engineering reality. The
authors of the paper wrap it all up with
this really powerful thought. For the
past decade, we've basically been
focused on micro design, right? Just
upgrading the engine. MISC is a really
compelling argument that we need to be
paying just as much attention to the
macro design to actually redesigning the
entire chassis of the car. It opens up a
whole new front for innovation. And that
leaves us with a pretty provocative
question to end on. MHC proves that the
basic blueprints we've been using for
years aren't set in stone. We can
fundamentally change how these massive
networks are put together. So, if we can
do that, what other assumptions are we
making that we should challenge? What
totally new crazy looking structures
will we build next? If you enjoyed this
deep dive into the architecture of AI,
make sure you subscribe for more
explainers that break down the complex
science that is shaping our future.
Resume
Read
file updated 2026-02-12 02:45:08 UTC
Categories
Manage