The Strange Math That Predicts (Almost) Anything
KZeIEiBrT_w • 2025-07-25
Transcript preview
Open
Kind: captions
Language: en
How many times do you need to shuffle a
deck of cards to make them truly random?
How much uranium does it take to build a
nuclear bomb?
How can you predict the next word in a
sentence? And how does Google know which
page you're actually searching for?
Well, the reason we know the answer to
all of these questions is because of a
strange math feud in Russia that took
place over a hundred years ago.
In 1905, socialist groups all across
Russia rose up against Thesar, the ruler
of the empire. They demanded a complete
political reform or failing that that he
step down from power entirely.
This divided the nation into two. So on
one side you got the Tsarists, right?
They wanted to defend the status quo and
keep the tsar in power. But then on the
other side you had the socialists who
wanted this complete political reform.
And this division was so bad that it
crept into every part of society to the
point where even mathematicians started
picking sides. On the side of the zar
was Pavl Necrosov unofficially called
the zsar of probability. Necrasov was a
deeply religious and powerful man and he
used his status to argue that math could
be used to explain free will and the
will of God. His intellectual nemesis on
the socialist side was Andre Marov, also
known as Andre the Furious.
Marov was an atheist and he had no
patience for people who were being
unrigorous, something he considered
Necrosoft to be because in his eyes,
math had nothing to do with free will or
religion. So he publicly criticized
Necrosoft's work, listing it among the
abuses of mathematics. Their feud
centered on the main idea people had
used to do probability for the last 200
years. And we can illustrate this with a
simple coin flip. When I flip the coin
10 times, I get six times heads and four
times tails, which is obviously not the
50/50 you'd expect. But if I keep
flipping the coin, then at first the
ratio jumps all over the place. But
after a large number of flips, we see
that it slowly settles down and
approaches 50/50.
And in this case, after 100 flips, we
end up on 51 heads and 49 tails, which
is almost exactly what you would expect.
This behavior that the average outcome
gets closer and closer to the expected
value as you run more and more
independent trials is known as the law
of large numbers. It was first proven by
Jacob Bernoli in 1713, and it was the
key concept at the heart of probability
theory right up until Marov Necrosov.
But Berni only proved that it worked for
independent events like a fair coin flip
or when you ask people to guess how much
they think an item is worth where one
event doesn't influence the others. But
now imagine that instead of asking each
person to submit their guess
individually, you ask people to shout
out their answer in public. Well, in
this case, the first person might think
it's an extraordinarily valuable item
and say it's worth around $2,000.
But now all the other people in the room
are influenced by this value and so
their guesses have become dependent. And
now the average doesn't converge to the
true value but instead it clusters
around a higher amount. And so for 200
years probability had relied on this key
assumption that you need independence to
observe the law of large numbers. And
this was the idea that sparked Necrosov
and Marov's feud. See, Necrosov agreed
with Bernoli that you need independence
to get the law of large numbers. But he
took it one step further. He said, "If
you see the law of large numbers, you
can infer that the underlying events
must be independent." Take this table of
Belgian marriages from 1841 to 1845.
Now, you see that every year the average
is about 29,000. And so, it seems like
the values converge and therefore that
they follow the law of large numbers.
And when Microsoft looked at other
social statistics like crime rates and
birth rates, he noticed a similar
pattern. But now think about where all
this data is coming from. It's coming
from decisions to get married, decisions
to commit crimes, and decisions to have
babies, at least for the most part. So
Necrosoft reasoned that because these
statistics follow the law of large
numbers, the decisions causing them must
be independent. In other words, he
argued that they must be acts of free
will. So to him, free will wasn't just
something philosophical. It was
something you could measure. It was
scientific.
But to Marov, Necrosov was delusional.
He thought it was absurd to link
mathematical independence to free will.
So Marov set out to prove that dependent
events could also follow the law of
large numbers and that you can still do
probability with dependent events.
To do this, he needed something where
one event clearly depended on what came
before. And he got the idea that this is
what happens in text. Whether your next
letter is a consonant or a vow depends
heavily on what the current letter is.
So to test this, Marov turned to a poem
at the heart of Russian literature.
Eugene Anigan by Alexander Pushkin.
He took the first 20,000 letters of the
poem, stripped out all punctuation and
spaces, and pushed them together into
one long string of characters. He
counted the letters and found that 43%
were vowels and 57% were consonants.
Then Marov broke the string into
overlapping pairs. That gave him four
possible combinations. Vow, vowel,
consonant, consonant, vow, consonant, or
consonant vowel.
Now if the letters were independent the
probability of a vow vow pair would just
be the probability of a vow twice which
is about.18 or an 18% chance.
But when Marov actually counted he found
valval pairs only show up 6% of the time
way less than if they were independent.
And when he checked the other pairs he
found that all actual values differed
greatly from what the independent case
would predict. So Marov had shown that
the letters were dependent.
And to beat Necrosov, all he needed to
do now was show that these letters still
follow the law of large numbers. So he
created a prediction machine of sorts.
He started by drawing two circles, one
for a vowel and one for a consonant.
These were his states. Now, say you're
at a vowel, then the next letter could
either be a vowel or a consonant. So he
drew two arrows to represent these
transitions.
But what are these transition
probabilities? Well, Marov knew that if
you pick a random starting point, there
is a 43% chance that it'll be a vowel.
He also knew that val pairs occur about
6% of the time. So to find the
probability of going from a vowel to
another vowel, he divided 06 by43 to
find a transition probability of about
13%. And since there is a 100% chance
that another letter comes next, all the
arrows going from the same state need to
add up to one. So the chance of going to
a consonant is 1 minus.13 or 87%.
He repeated this process for the
consonants to complete his predictive
machine. So let's see how it works.
We'll start at a vowel. Next, we
generate a random number between 0 and
1. If it's below.13, we get another
vowel. If it's above, we get a
consonant. We got 78, so we get a
consonant. Then we generate another
number and check if it's above or below
67.21,
so we get a vowel. Now, we can keep
doing this and keep track of the ratio
of vowels to consonants. At first, the
ratio jumps all over the place, but
after a while, it converges to a steady
value. 43% vowels and 57% consonants.
The exact split Marov had counted by
hand.
So Marov had built a dependent system, a
literal chain of events. And he showed
that it still followed the law of large
numbers, which meant that observing
convergence in social statistics didn't
prove that the underlying decisions were
independent. In other words, those
statistics don't prove free will at all.
Marov had shattered Necrosov's argument
and he knew it. So he ended his paper
with one final dig at his rival. Thus
free will is not necessary to do
probability. In fact independence isn't
even necessary to do probability with
this marov chain as it came to be known.
He found a way to do probability with
dependent events. This should have been
a huge breakthrough because in the real
world almost everything is dependent on
something else. I mean, the weather
tomorrow depends on the conditions
today. How a disease spreads depends on
who's infected right now. And the
behavior of particles depends on the
behavior of particles around them. Many
of these processes could be modeled
using Marov chains.
Do people think it was like a mic drop
moment like oh Necrosov is out like
Marov's a man or people didn't really
notice or it was obscure or I I feel
like people didn't really notice like it
wasn't a really big thing
and Mark of himself seemingly didn't
care much about how it might be applied
to practical events. He wrote I am
concerned only with questions of pure
analysis. I refer to the question of the
applicability with indifference.
Little did he know that this new form of
probability theory would soon play a
major role in one of the most important
developments of the 20th century.
On the morning of the 16th of July 1945,
the United States detonated the gadgets,
the world's first nuclear bomb.
The 6 kg plutonium bomb created an
explosion that was equivalent to nearly
25,000 tons of TNT. This was the
culmination of the top secret Manhattan
project. A three-yearong effort by some
of the smartest people alive, including
people like J. Robert Oppenheimer, John
Vonoyman, and a littleknown
mathematician named Stanislav Ulam.
Even after the war ended, Ulam continued
trying to figure out how neutrons behave
inside a nuclear bomb. Now, a nuclear
bomb works something like this. Say you
have a core of uranium 235. Then when a
neutron hits a U235 nucleus, the nucleus
splits, releasing energy and crucially
two or three more neutrons. If on
average those new neutrons go on to hit
and split more than one other U235
nucleus, you get a runaway chain
reaction. So you have a nuclear bomb,
but uranium 235, the fizzile fuel needed
for the bombs, was really hard to get.
So one of the key questions was just how
much of it do you need to build a bomb?
And this is why Ulam wanted to
understand how the neutrons behave.
But then in January of 1946, everything
came to a halt.
Ulam was struck by a sudden and severe
case of encphilitis, an inflammation of
the brain that nearly killed him. His
recovery was long and slow with Ulam
spending most of his time in bed. To
pass the time, he played a simple card
game, solitire. But as he played
countless games, winning some, losing
others, one question kept nagging at
him. What are the chances that a
randomly shoveled game of solitaire
could be won?
It was a deceivingly difficult problem
to solve. Ulan played with all 52 cards
where each arrangement created a unique
game. So the total number of possible
games was 52 factorial or about 8 * 10^
the 67.
So solving this analytically was
hopeless.
But then Ulam had a flash of insight.
What if I just play hundreds of games
and count how many could be won? that
would give him some sort of statistical
approximation of the answer.
Back at Los Alamos, the remaining
scientists grappled with much harder
problems than solitire, like figuring
out how neutrons behave inside a nuclear
core.
In a nuclear core, there are trillions
and trillions of neutrons or interacting
with their surroundings. So, the number
of possible outcomes is immense, and
computing it directly seemed impossible.
But when Ulam returned to work, he had a
sudden revelation. What if we could
simulate these systems by generating
lots of random outcomes like I did with
Solitaire? He shared this idea with
Vonoyman who immediately recognized its
power but also spotted a key problem.
See, in solitaire, each game is
independent. How the cards are dealt in
one game have no effect on the next. But
neutrons aren't like that. A neutron's
behavior depends on where it is and what
it has done before.
So you couldn't just sample random
outcomes like in solitaire. Instead you
needed to model a whole chain of events
where each step influenced the next.
What Venoyman realized is that you
needed a mark of chain. So they made one
and a much simplified version of it
works something like this. Now the
starting state is just a neutron
traveling through the core and from
there three things can happen. It can
scatter off an atom and keep traveling.
So that gives you an arrow going back to
itself. It can leave the system or get
absorbed by a non-visile material in
which case it no longer takes part in
the chain reaction and so it ends its
mark of chain. Or it can strike another
uranium 235 atom triggering a fision
event and releasing two or three more
neutrons that then start their own
chains.
But in this chain, the transition
probabilities aren't fixed. They depend
on things like the neutron's position,
velocity, and energy, as well as the
overall configuration and mass of
uranium.
So, a fastmoving neutron might have a
30% chance to scatter, 50% chance to be
absorbed or leave, and a 20% chance to
cause fishision. But a slower moving
neutron would have different
probabilities.
Next, they ran this chain on the world's
first electronic computer, the ENIAC.
The computer started by randomly
generating a neutron starting conditions
and stepped through the chain to keep
track of how many neutrons were produced
on average per run known as the
multiplication factor K. So if on
average one neutron produces another two
neutrons then K is equal to two and if
on average every two neutrons produce
three neutrons then K is equal to 3
over2 and so on. Then after stepping
through the full chain for a specified
number of steps, we collect the average
K value and record that number in a
histogram. This process was then
repeated hundreds of times and the
results tell it up, giving you a
statistical distribution of the outcome.
If you find that in most cases K is less
than one, the reaction dies down. If
it's equal to one, there's a self-
sustaining chain reaction, but it
doesn't grow. And if K is larger than
one, the reaction grows exponentially
and you've got a bomb
with it. Vonoyman and Ulam had a
statistical way to figure out how many
neutrons were produced without having to
do any exact calculations. In other
words, they could approximate
differential equations that were too
hard to solve analytically.
All that was needed was a name for the
new method. Now, Ulam's uncle was a
gambler, and the random sampling and
high stakes reminded Ulam of the Monte
Carlo Casino in Monaco, and the name
stuck. The Monte Carlo method was born.
The method was so successful that it
didn't stay secret for long. By the end
of 1948, scientists at another lab,
Argon in Chicago, used it to study
nuclear reactor designs, and from there,
the idea spread quickly. Ulam later
remarked, "It is still an unending
source of surprise for me to see how a
few scribbles on a blackboard could
change the course of human affairs,
and it wouldn't be the last time a
Markov chain-based method changed the
course of human affairs.
In 1993, the internet was open to the
public and soon it exploded. By the mid1
1990s, thousands of new pages appeared
every day and that number was only
growing.
This created a new kind of problem. I
mean, how do you find anything in this
everexpanding sea of information?
In 1994, two Stanford PhD students,
Jerry Young and David Pho, founded the
search engine Yahoo to address this
issue.
But they needed money. So, a year later,
they arranged to meet with Japanese
billionaire Masayoshi Sun, also known as
the Bill Gates of Japan.
They were looking to raise $5 million
for their new startup. But Sun has other
plans.
He offers to invest a full $100 million
instead. That's 20 times more than what
the founders asked for. So, Jerry Yang
declines, saying, "We don't need that
much." But Sun disagrees. Jerry,
everyone needs $100 million.
Before the founders get a chance to
respond, Sun jumps in again and asks,
"Who are your biggest competitors?"
Excite and LOS, the pair respond. Son
orders his associate to write those
names down. And then he says, "If you
don't let me invest in Yahoo, I will
invest in one of them and I'll kill
you."
See, Sun had realized something. None of
the leading search engines at the time
had any superior technology. They didn't
have a technological advantage over the
others. They all just ranked pages by
how often a search term appears on a
given page. So, the battle for the
number one search engine would be
decided by who could attract the most
users, who could spend the most on
marketing.
>> Likos, go get it.
>> Get los or get lost. This is Revolution.
[Music]
Yahoo.
>> And marketing required a lot of money.
Money that Sun had so he could decide
who won the war. Yahoo's founders
realized they were left with no real
choice but to accept Sun's investment.
So here we are right in the middle of
Yahoo. And within 4 years, Yahoo became
the most popular site on the planet. In
the time it takes to say this sentence,
Yahoo will answer 79,000 information
requests worldwide. The two men are now
worth $120 million each.
>> Yahoo.
>> But Yahoo had a critical weakness.
See, Yahoo's keyword search was easy to
trick. To get your page ranked highly,
you could just repeat keywords hundreds
of times, hidden with white text on a
white background.
One thing they didn't have in those
early days was a notion of quality of
the result. So they had a notion of
relevance, saying, "Does this document
talk about the thing that you're
interested in?" But there wasn't really
a notion of which ones are better.
>> What they really needed was a way to
rank pages by both relevance and
quality. But how do you measure the
quality of a web page? Well, to
understand that, we need to borrow an
idea from libraries. So, I'm old enough
that library books used to have a paper
card in it that was a stamp of all the
due dates of when it was due back. You
took a book and if it had a lot of
those, you said, "Oh, this is probably a
good book." And if it didn't have any,
you said, "Well, maybe this isn't the
best book." Stamps acted like
endorsements. The more stamps, the
better the book must be. And the same
idea can be applied to the web. Over at
Stanford, two PhD students, Sergey Brin
and Larry Page, were working on this
exact problem. Breen and Paige realized
that each link to a page can be fought
off as an endorsement and the more links
a page sends out, the less valuable each
vote becomes. So what they realized is
that we can model the web as a markoff
chain.
To see how this works, imagine a toy
internet with just four web pages. Call
them Amy, Ben, Chris, and Dan. These are
our states. Typically, one web page
links to others, allowing you to move
between them. These are our transitions.
In this setup, Amy only links to Ben, so
there's a 100% chance of going from Amy
to Ben. Ben links to Amy, Chris, and
Dan, so there's a 33% chance of going to
any of those pages. And we can fill out
the other transition probabilities in
the same way. So now we can run this
Markoff chain and see what happens.
Imagine you're a surfer on this web. You
start on a random page, say Amy, and you
keep running the machine and keep track
of the percentage of time you spend on
each page. Over time, the ratio settles
and the scores give us some measure of
the relative importance of these pages.
You spend the most time on Ben, so Ben
is ranked first, followed by Amy, then
Dan, and lastly Chris. It might seem
like there's an easy way to beat the
system. Just make a 100 pages all
linking to your website. Now you get a
100 full votes and you'll always rank on
top, but that is not the case. While
during their first few steps, they might
make your page seem important, none of
the other websites link to them. So over
many steps, their contributions don't
matter. You might have many links, but
they're not quality links, so they don't
affect the algorithm.
But there is still one problem, though.
Not all pages are connected. In networks
like this one, a random server can get
stuck in a loop, never reaching the rest
of the web. So to fix this, we can set a
rule that 85% of the time, our random
server just follows a link like normal.
But then for about 15% of the time, they
just jump to a page at random. This
damping factor makes sure that we
explore all possible parts of the web
without ever getting stuck.
By using mark of chains, Paige and Brin
had built a better search engine and
they called it page rank
>> because it's talking about how pages uh
react web pages react with each other
and also because uh the founder's name
is Larry page so he snuck that in.
>> With page rank, they got much better
search results, often getting you to the
site you were looking for in one go.
Although to some this sounded like a
terrible idea. Others said, "Oh, well,
you're telling me you get a search that
will get the right result on the first
answer." I don't want that because if it
takes them three or four chances
searches to get the right answer, then I
have three or four chances to show ads.
And if you get them the answer right
away, I'm just going to lose them. So,
uh, you know, I don't see why better
search is better.
>> But Pitch and Brin disagreed. They were
convinced that if their product was far
superior, then people would flock to it.
I would say it actually is a democracy
that works. If all pages were equal, uh,
anybody can manufacture as many pages as
they want. I can set up a billion pages
on my server tomorrow. We shouldn't
treat them all as equal. Uh, just
looking at the data out of out of
curiosity, we found that we had
technology to do a better job of search
and we realized how impactful having
great search can be.
>> And so in 1998, they launched their new
search engine to take on Yahoo.
Initially, they called it backup after
the backlinks it analyzed. But then they
realized that maybe that's not the most
attractive name. Now, their ambitions
were big to essentially index all the
pages on the internet and they needed a
name equally as big. So, they thought of
the largest number they could think of,
10 to the power 100, a Google. But then,
when trying to register their domain,
they accidentally misspelled it. And so,
Google was born.
Over
the next four years, Google overthrew
Yahoo to become the most used search
engine.
>> Everyone who knows the internet almost
certainly knows Google.
>> Googling is like oxygen to teenagers.
And today, Alphabet, which is Google's
parent company, is worth around $2
trillion.
>> When Google makes even the slightest
change in its algorithms, it can have
huge effects. Google Google.
>> They're on fire. And the reason why
they're on fire is because they're
focused. And they're more focused than
Yahoo who does search. They're more
focused than Microsoft who does search
with Bing. Yahoo has lots of traffic.
They always have. They have some really
great properties, but I don't think
Yahoo is the go-to place, you know. And
at the heart of this trillion dollar
algorithm is a mark of chain which only
looks at the current state to predict
what's going to happen next.
But in the 1940s, Claude Shannon, the
father of information theory, started
asking a different question. He went
back to Marov's original idea of
predicting text, but instead of just
using vowels and consonants, he focused
on individual letters. And he wondered,
what if instead of looking at only the
last letter as a predictor, I look at
the last two? Well, with that, he got
text that looked like this. Now it
doesn't make much sense but there are
some recognizable words like way of off
and the
but Shannon was convinced he could do
better. So next instead of looking at
letters he wondered what if I use entire
words as predictors that gave him
sentences like this. The head and in
frontal attack on an English writer that
the character of this point is therefore
another method for the letters that the
time of whoever told the problem for an
unexpected.
Now, clearly this doesn't make any
sense, but Shannon did notice that
sequences of four words or so generally
did make sense. For instance, attack on
an English writer kind of make sense.
So, Shannon learned that you can make
better and better predictions about what
the next word is going to be by taking
into account more and more of the
previous words. It's kind of like what
Gmail does when it predicts what you're
going to type next. And this is no
coincidence. The algorithms that make
these predictions are based on mark of
chains.
>> They're not necessarily using letters,
you know, they they they use what they
call tokens, some of which are letters,
some of which are words, marks of
punctuation, whatever. So, it's a it's a
bigger set than just the alphabet. The
game is simply we have this string of
tokens that you know might be 30 long
and we're asking what are the odds that
the next token is this or this or this.
But today's large language models don't
treat all those tokens equally because
unlike simple markoff chains they also
use something called attention which
tells the model what to pay attention
to. So in the phrase the structure of
the cell, the model can use previous
context like blood and mitochondria to
know the cell most likely refers to
biology rather than a prison cell and it
uses that to tune its prediction. But as
large language models become more
widespread, one concern is that the text
they produce ends up on the internet and
that becomes training data for future
models.
>> When you start doing that, the real the
game is very soon over. You come in this
case to a very dull stable state. It
just says the same thing over and over
and over again forever. The language
models are vulnerable to this process.
>> And any system like this where we have a
feedback loop will become hard to model
using markoff chains. Take global
warming for instance. As we increase the
amount of carbon dioxide in the air, the
average temperature of the earth
increases. But as the temperature
increases, the atmosphere can hold more
water vapor, which is an incredibly
powerful greenhouse gas. And with more
water vapor, the temperature increases
further, allowing for even more water
vapor. So you get this positive feedback
loop which makes it hard to predict
what's going to happen next. So there
are some systems where marov chains
don't work. But for many other dependent
systems, they offer a way of doing
probability.
But what's fascinating is that all these
systems have extremely long histories. I
mean, you could trace back all the
letters in a text, trace back all the
interactions of what a neutron did, or
trace back the weather for weeks. But
the beautiful thing Marov and others
found is that for many of these systems,
you can ignore almost all of that. You
can just look at the current state and
forget about the rest. That makes these
systems memoryless. And it's this
memoralist property that makes mark of
chain so powerful because it's what
allows you to take these extremely
complex systems and simplify them a lot
to still make meaningful predictions.
As one paper put it, problem solving is
often a matter of cooking up an
appropriate marov chain.
>> It's kind of ridiculous to me that this
basic fact of mathematics would come out
of a fight like that which you know
really had nothing to do with it. But
all the evidence suggests that it really
was this determination to show up the
crossoff that led Marov to do the work.
>> But there's one question we still
haven't answered when playing solitire.
How did Ulam know his cards were
perfectly shuffled? I mean, how many
shuffles does it take to get a
completely random arrangement of cards?
>> If you have a deck of cards, you need to
shuffle it, right?
>> Okay. How often if you're shuffling like
you know you split it in half and then
you do the
>> How often do you have to shuffle it to
make it completely random?
>> Two.
>> Two. I'm going with 26.
>> Four times.
>> I don't know.
>> 52 times.
>> Okay. Okay. It's not a bad guess.
>> Seven.
>> It is seven. Really?
>> Yeah. So, you can think of card
shuffling as a markoff chain where each
deck arrangement is a state and then
each shuffle is a step. And so for a
deck of 52 cards, if you riffle shuffle
it seven times, then every arrangement
of the deck is about equally likely. So
it's basically random.
But I can shuffle like that. So for me,
what I do is I do it like this. How many
times do you think you have to shuffle
like this to get it random?
What do you think? And perhaps more
importantly, how would you go about
working it out? Well, that's where
today's sponsor, Brilliant, comes in.
Brilliant is a learning app that gets
you hands-on with problems just like
this. Whether it's math, physics,
programming, or even AI, Brilliant's
interactive lessons and challenges let
you play your way to a sharper mind. You
can discover how large language models
actually work. From basic Markov chains
to complex neural networks, or dig into
the math behind this shuffling question.
It's a fun way to build knowledge and
skills that help you solve all kinds of
problems. Which brings us back to our
shuffle. So, Casper, what actually is
the answer?
>> It's actually over 2,000. Over crazy,
right?
>> Yeah.
>> So, the next time someone offers to
shuffle before a game, make sure they're
doing it right. Seven riffles or it
doesn't count. But the interesting part
isn't just knowing that. It's
understanding why and seeing how a
simple question can lead you to some
surprisingly complex mathematics. And
that's what Brilliant is all about. So,
to try everything Brilliant has to offer
for free for a full 30 days, visit
brilliant.org/veritassium.
Click that link in the description or
scan this handy QR code. And if you sign
up, you'll also get 20% off their annual
premium subscription. So, I want to
thank Brilliant for sponsoring this
video. And I want to thank you for
watching.
Easy.
Resume
Read
file updated 2026-02-13 13:08:59 UTC
Categories
Manage