Transcript
8WqNFDJFXxk • Beyond LLMs: The Rise of World Models and Spatial Intelligence
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/FoundationModelsForRobotics/.shards/text-0001.zst#text/0067_8WqNFDJFXxk.txt
Kind: captions
Language: en
Okay, so you've heard all the hype about
AGI, right? Artificial general
intelligence. And yeah, the headlines
are all about these giant language
models. But what if I told you that
behind the scenes something way more
fundamental is happening? The biggest AI
labs are all starting to realize
something huge. You can't learn true
intelligence just from text. Nope. It
has to be grounded in the real world. So
today, we're going deep on what might
just be the final critical piece of the
AGI puzzle, world models. This is all
about teaching machines to get physics,
space, and cause and effect, not by
reading a textbook, but by actually
experiencing it. All right, so here's
what we've got on tap for this deep
dive. We're going to start with why AI
has this major grounding problem. Then
we'll nail down what exactly a world
model is. After that, we'll get into the
two competing ideas on how to actually
build one. Then, we'll check out the big
players in the game and what they're up
to. We'll also see how this whole thing
is scaling up from a tiny sandbox to a
full-on crystal ball for the whole
planet. And finally, we'll talk about
the massive data gold rush that's
driving all of this. Okay, let's kick
things off. To really get this, you have
to understand a massive limitation in
the AI we have today. You know, the big
goal for a lot of people is AGI, an AI
that can think, plan, and act in the
world just like a person. But to pull
that off, it needs a deep, almost gut
level understanding of the physical
world. Here's the problem. An AI trained
only on text, like a large language
model, well, it can read the word
gravity. It can even spit out Newton's
laws of motion. But it has absolutely no
idea what it feels like for an apple to
fall from a tree. It's learned the word,
but not the reality behind it. And this
huge disconnect, this gap between
knowing the symbol and understanding the
substance. That's what we call AI's
grounding problem. And this quote from
the AI pioneer Dr. Feay Lee, it just
hits the nail on the head. She says,
"LMs are eloquent but inexperienced,
knowledgeable but ungrounded." I mean,
think about that. They've totally
mastered our language, our art, all our
abstract ideas, but they're just words
in the dark. It's like this. An LLM can
scan 10,000 recipes and write you the
most perfect, mouthwatering description
of a sule you've ever read. It knows all
the ingredients, the temperatures, the
chemistry, but it has zero clue how to
actually crack an egg or whisk it just
right or feel the heat coming off an
oven. See, it has all the knowledge but
none of the experience. That's the
grounding problem in a nutshell. This
really gets to the heart of it, right?
LLMs are just masters of symbols, of
words. But when you talk about spatial
reasoning, that intuitive physics we do
every single second without even
thinking about it, they just fall flat.
They're, to use a term from the
internet, total word cells. But they're
not shape rotators. They can guess what
a 3D object is, but they don't truly gro
it. you know, they don't get its
three-dimensionality. Now, think about
how we learn from the second we're born.
We're basically little scientists. As a
toddler, you don't learn about gravity
from some book. Nope. You learn by
dropping your spoon off your high chair
over and over and over again and
watching it fall. You learn about
momentum and friction by taking those
first wobbly steps and falling down. Our
entire intelligence is built on this
foundation of what we see, hear, and
touch. It's that deep physical
understanding of the world that today's
AI is completely missing. So, that leads
us to the billiondoll question that's
driving pretty much the entire AI
industry right now. How do we fix this?
How do we finally give AI that true
spatial intelligence? And the answer
everyone is chasing from Google and
OpenAI to Nvidia and Runway is the world
model. All right, section two. So, what
exactly is a world model? And I want to
be clear here, this is not just another
piece of marketing jargon. It's a very
specific and incredibly powerful idea
for building the engine of spatial
intelligence that AI so desperately
needs. Okay, at its heart, a world model
is basically an AI's own internal
simulation of the world. Think of it
like its imagination. It's like giving
the AI its own private video game
engine. But, and this is the critical
part, in a game like Grand Theft Auto,
human developers have spent years hand
coding all the physics, right? They
write lines of code that say if a car
hits a wall this fast, it should crumple
like this. A world model does the
complete opposite. It's not given any
rules. It has to figure out the rules
for itself just by watching our world
through tons and tons of video data. It
learns that glass shatters, water
splashes, and balls bounce. Not because
someone programmed it to know that, but
because it's seen it happen millions of
times. Now, building on Dr. Fail's
foundational research. A true world
model has to have three key things going
for it. First, it has to be generative.
This means it can't just recognize the
world. It has to be able to create new,
believable scenes from scratch. And
those scenes have to obey the laws of
physics. So, if it generates a glass
falling off a table, that glass better
fall down, not up. And it needs to
shatter in a realistic way. Second, it
has to be natively multimodal. This is
huge. It needs to seamlessly blend
different types of data, video, audio,
text, even 3D maps just like we do. This
is a total gamecher for robotics. You
need to be able to give a robot a 2D map
and a simple command like go to the
kitchen and have it figure out how to
turn that into physical actions. And
third, it's got to be interactive, a
static 3D picture of a city. That's not
a world model. It needs to be a living
simulation. The model has to understand
how to simulate cars driving around, the
weather changing, and people interacting
with each other in that space. And this
this is the simple yet brilliant idea at
the core of how these models actually
learn. The comparison to LLMs is
perfect. A model like GPD4 got so good
at language by doing one simple thing
over and over. Predict the next word.
Well, world models do the exact same
thing, but for reality itself. By
forcing the model to get insanely good
at predicting what's going to happen in
the next few frames of a video, it has
no choice but to learn the underlying
physics of that reality. It has to learn
that if something goes behind a pillar,
it doesn't just disappear. That's object
permanence. It has to learn that if you
throw a ball, it's going to follow a
certain arc. That's an intuitive grasp
of gravity. All these complex physical
ideas just emerge, not because they were
programmed in, but because the model had
to learn them to get good at its one
job, predict what happens next. Okay, so
this is where it gets really
interesting. While pretty much everyone
agrees that world models are the future,
there's a huge and fascinating debate on
the best way to actually build one. And
this isn't just some nerdy technical
squabble. It's a deep philosophical
split about what a simulation even is.
You can basically break the whole field
into two camp. So on one side, you've
got the explicit 3D folks. Their goal is
to take video or images and create a
mathematically perfect tangible 3D
asset. you know, like a polygon mesh
you'd see in a video game or something
newer like a Gaussian splat, which is
basically a cloud of smart colored dots
in space. The point is the final product
is a real 3D object you can drop into
pro tools like Blender or Nvidia's
Omniverse. Then you have the pixel
generation camp, and their philosophy is
completely different. They basically
argue, why bother with all that hard
work of creating a perfect 3D model? To
the AI, it's all just pixels anyway. So,
their goal is to skip that 3D step
completely and just focus on generating
the next believable frame of a video.
It's really the difference between being
a digital sculptor and being a neural
network that's acting like a real-time
movie director. So, let's dig into where
this explicit 3D approach really shines.
Its biggest advantage is precision and
control. For instance, if you're Nvidia
trying to train a robot to pick up an
apple, you need to simulate that scene
with absolute accuracy. The robot has to
know the exact 3D coordinates, the
weight, the friction of that apple. A
precise 3D model gives you that. Same
thing in Hollywood for virtual
production where you've got a real actor
standing in front of a giant LED screen.
That digital background has to have a
correct and stable 3D geometry so the
camera can move around it naturally.
This whole approach is about plugging
into existing professional pipelines
that need that kind of mathematical
perfection. Okay. Now, for the pixel
generation camp, their superpower is
scalability. And that scalability comes
from the data they can use. See, there's
only so much clean, labeled 3D data out
there, but there are trillions of hours
of messy, unstructured video on the
internet. And this approach can learn
from all of it. Take training an AI for
a retail simulation. With the 3D
approach, you'd have to pay expensive 3D
artists to painstakingly model and
animate an angry customer character.
With the pixel approach, you just train
a model on thousands of hours of real
footage of customer freakouts and
generate that scenario whenever you
want. This makes the simulation
infinitely more varied, way cheaper, and
it can capture all those subtle human
behaviors that are almost impossible to
animate by hand. By the way, if you're
finding this breakdown helpful, make
sure you hit subscribe so you don't miss
our future deep dives. All right,
keeping those two big ideas in mind, we
can now start to map out who's doing
what. Let's take a look at the key
players in this race to simulate reality
because the path they've chosen really
tells you everything about what they're
trying to achieve. Luma AI is the
perfect case study here. They first got
famous in the explicit 3D world with
their amazing work on something called
neural radiance fields or nerfs. A nerf
is basically a smart neural network that
can create a full 3D scene from just a
handful of 2D pictures. But even though
they were killing it, Luma recently made
a huge pivot. They now believe that
direct video generation, the pixel
approach, is the far more scalable path
to AGI. And their logic is simple but
powerful. The amount of 3D data in the
world is a tiny puddle, but the amount
of video data is a massive ocean. For a
leader in the 3D space to make a move
like that, that's a huge signal to the
rest of the industry about where things
are probably headed. Then you've got
Runway ML, who are definitely one of the
leaders in the pixel generation camp.
Their grand vision is straight out of
Star Trek. They want to give every
creator a personal hollow deck. They're
building what's called an auto
reggressive model, which sounds
complicated, but it just means the model
generates one frame of video. Then it
looks at the frame it just made and uses
that to generate the next one and the
one after that. It creates this
continuous interactive experience. The
analogy is perfect. It's like streaming
a video game, but there's no game engine
like Unreal running on the server. It's
just a giant neural network dreaming up
your reality in real time, all based on
your prompts. Their focus is all about
getting this into the hands of creators
and entertainers. And then there is
Tesla, which is a really fascinating
example of a hybrid approach that's
solving a missionritical problem today.
For their self-driving simulations, they
start with a solid explicit 3D
foundation. They use techniques like
Gaussian splatting to build a precise 3D
model of a road from their car's camera
footage. That gives them a stable,
geometrically correct stage, but then
they use generative pixel level AI to be
the director of the scene. They can take
a real video of a drive on a sunny day
and tell the AI, "Okay, run it again,
but make it a blizzard." Or they can add
a virtual pedestrian that suddenly steps
into the road, or even simulate
dangerous crashes you could never stage
in real life. It really gives them the
best of both worlds, the precision of 3D
and the infinite variety of generative
AI. So, let's just take a beat and sum
up this whole competitive landscape.
You've got players like Runway ML and
Google with its genie model, and they
are all in on pixel generation, focusing
on content and interactive worlds. Then
you have Luma, who made that famous
pivot away from 3D to pixels, all with
the grand goal of building AGI. In the
hybrid corner, you have Tesla taking a
super practical approach using both
methods to solve the incredibly hard
problem of self-driving cars. And then
you have some really interesting
research coming out of places like Bite
Dance where they're trying to teach our
model to watch a 2D video of a person
walking and figure out the 3D path of
their arms and legs. I mean, think about
how valuable that could be for robotic.
Okay, now let's zoom out way out. So
far, we've been talking about
simulations on a pretty small scale. You
know, a single kitchen, a city block.
But the real ambition here, it doesn't
stop at the city limits. The ultimate
vision is to scale these world models up
to simulate the entire planet. And this
isn't science fiction. This escalation
of scale is happening as we speak. Right
now, a lot of the cutting edge stuff is
at the micro level, simulating one robot
in one room. The near future, which
companies like Tesla are already deep
into, is scaling up to complex systems
like an entire city for self-driving
cars. The next logical step, which we're
seeing from giants like Google with
projects like Alpha Earth, is to go
beyond just man-made stuff and simulate
global natural systems. And that leads
to the ultimate goal, the holy grail of
this whole field, to go from just
simulation to actual prediction. To
build a model so good it can not only
copy our world but forecast its future.
And Nvidia's Earth 2 project is maybe
the most mind-blowing example of this
ambition. They are literally building a
digital twin of planet Earth. The idea
is to create a planetary scale world
model and just pour in every bit of data
we have. satellite images, weather
station data, ocean sensor readings,
everything to create a true crystal ball
for our climate. Just imagine being able
to predict a hurricane's path with
perfect accuracy weeks in advance, or
modeling the exact impact of a new
policy on deforestation in the Amazon,
or helping farmers predict crop yields
as the climate changes. This takes
simulation way beyond robotics and into
the realm of planetary management. The
implications for science and the economy
are just staggering. So, okay, the
vision is clear. The ambition is huge
and the computing power from companies
like Nvidia is getting more powerful by
the day. So what's the holdup? What's
the final bottleneck? What is the one
thing stopping us from building these
incredible digital realities tomorrow?
Well, the answer is simple and it's the
oldest problem in machine learning.
You've got to get the right data. And
this chart, it shows the problem
perfectly. For those big planetary scale
models, we are practically drowning in
data from thousands of satellites.
Great. For the general pixel generation
models, we have the huge but messy ocean
of video on places like YouTube. Okay.
But when you get to the data you need to
train an AI that has a body, like a
humanoid robot that needs to walk around
our world, the well is almost bone dry.
We are desperately short on what's
called egocentric data or firsterson
point of view data. And this data is so
valuable because it doesn't just show
what the world looks like. It shows how
an agent's own actions, the movement of
their hands, their head, changes what
they see. It is the data of interaction,
and without it, a robot can never really
learn how to act. This shortage of data
has basically started a modern-day gold
rush. You have these massive corporate
projects like Meta's Project Arya, which
plans to use AR glasses to capture
first-person data from millions of
people all over the world. Then you've
got scrappy startups literally paying
people to wear cameras on their heads
and just do everyday stuff like cook
dinner or clean the house all just to
get this priceless training data. The
whole industry knows that a giant
diverse data set of firstperson
interaction is the key to unlocking
embodied AI and the race is on to see
who can build the biggest and the best.
And that brings us to our final and
honestly a pretty profound question
about where all this is headed. The team
at Luma AI put it perfectly in a recent
blog post. Reality is the data set for
AGI. I mean, just think about that. If
that's true, if the only way to build
true artificial intelligence is by
building the most accurate simulation of
our world, then the defining question of
the next decade is this. Who is going to
own the best copy of reality? The game
is no longer just about who has the
smartest algorithm or the most computer
chips. It's a race to capture, own, and
train on the most complete data set of
our shared physical world. I really hope
this deep dive gave you a much clearer
picture of this incredible race to build
world models. If you want to keep up
with the technologies that are literally
building our future, you know what to
do. Hit that subscribe button for more
explainers that break down the cutting
edge of tech. Thanks for watching.