Jim Keller: Moore's Law, Microprocessors, and First Principles

Jim Keller: Moore's Law, Microprocessors, and First Principles | Lex Fridman Podcast #70

Nb2tebYAaOA • 2020-02-05

Transcript preview

Open

Kind: captions
Language: en
the following is a conversation with Jim
Keller legendary microprocessor engineer
who has worked at AMD Apple Tesla and
now Intel he's known for his work on AMD
K 7 K 8 K 12 and Xen microarchitectures
Apple a4 and a5 processors and co-author
of the specification for the x86 64
instruction set and hyper transport
interconnect he's a brilliant first
principles engineer and out-of-the-box
thinker and just an interesting and fun
human being to talk to this is the
artificial intelligence podcast if you
enjoy it subscribe on YouTube give it
five stars an apple podcast follow on
Spotify supported on patreon or simply
connect with me on Twitter Alex Friedman
spelled Fri D ma a.m. I recently started
doing ads at the end of the introduction
I'll do one or two minutes after
introducing the episode and never any
ads in the middle that can break the
flow of the conversation I hope that
works for you and doesn't hurt the
listening experience this show is
presented by cash app the number one
finance I up in the App Store
I personally use cash app to send money
to friends but you can also use it to
buy sell and deposit Bitcoin in just
seconds cash app also has a new
investing feature you can buy fractions
of a stock say $1 worth no matter what
the stock price is brokers services are
provided by cash app investing a
subsidiary of square and member si PC
I'm excited to be working with cash app
to support one of my favorite
organizations called first best known
for their first robotics and Lego
competitions they educate and inspire
hundreds of thousands of students in
over 110 countries and have a perfect
rating a charity navigator which means
that donated money is used to maximum
effectiveness when you get cash app from
the App Store Google Play and use code
Lex podcast you'll get ten dollars and
cash app will also donate ten dollars to
the first which again is an organization
that I've personally seen inspire girls
and boys the dream of engineering a
better world and now here's my
with Jim Keller what are the differences
in similarities between the human brain
and a computer with the microprocessors
core let's start with a philosophical
question perhaps well since people don't
actually understand how human brains
work I think that's true I think that's
true so it's hard to compare them
computers are you know there's really
two things there's memory and there's
computation right and to date almost all
computer architectures are global memory
which is a thing right and then
computation where you pull data and you
do relatively simple operations on it
and write data back so it's decoupled in
modern in modern computers and you think
in the human brain everything's a mesh a
mess that's combined together what
people observe is there's you know some
number of layers of neurons which have
local and global connections and
information is stored in some
distributed fashion and people build
things called neural networks in
computers where the information is
distributed in some kind of fashion you
know there's a mathematics behind it I
don't know that the understandings that
is super deep the computations we run on
those are straightforward computations I
don't believe anybody has said a neuron
does this computation so to date it's
hard to compare them I would say so
let's get into the basics before we zoom
back out how do you build a computer
from scratch what is a microprocessor
what is it microarchitecture what's an
instruction set architecture maybe even
as far back as what is a transistor so
the special charm of computer
engineering is there's a relatively good
understanding of abstraction layers so
down to bottom you have atoms and atoms
get put together in materials like
silicon or dope silicon or metal and we
build transistors on top of that we
build logic gates
right and in functional units like an
adder or subtractor or an instruction
parsing unit and we assemble those into
you know processing elements modern
computers are built out of you know
probably 10 to 20 locally you know
organic processing elements or coherent
processing elements and then that runs
computer programs right so there's
abstraction layers and then software you
know there's an instruction set you run
and then there's assembly language C C++
Java JavaScript you know there's
abstraction layers you know essentially
from the atom to the data center right
so when you when you build a computer
you know first there's a target like
what's it for look how fast does it have
to be which you know today there's a
whole bunch of metrics about what that
is and then in an organization of you
know a thousand people who build a
computer there's lots of different
disciplines that you have to operate on
does that make sense and so so there's a
bunch of levels abstraction of in in
organizational I can tell and in your
own vision there's a lot of brilliance
that comes in it every one of those
layers some of it is science some was
engineering some of his art what's the
most
if you could pick favorites what's the
most important your favorite layer on
these layers of abstractions where does
the magic enter this hierarchy I don't
really care that's the fun you know I'm
somewhat agnostic to that so I would say
for relatively long periods of time
instruction sets are stable so the x86
instruction said the arm instruction set
what's an instruction set so it says how
do you encode the basic operations load
store multiply add subtract conditional
branch you know there aren't that many
interesting instructions look if you
look at a program and it runs you know
90% of the execution is on 25 opcodes
you know 25 instructions on those are
stable right what does it mean stable
until architecture has been around for
twenty-five years it works it works and
that's because the basics you know or
defined a long time ago right now the
way an old computer ran is you fetched
instructions and you executed them in
order to the load do the ad do the
compare the way a modern computer works
is you fetch large numbers of
instructions say 500 and then you find
the dependency graph between the
instructions and then you you execute in
independent units those little micro
graphs so a modern computer like people
like to say computers should be simple
and clean but it turns out the market
for a simple complete clean slow
computers is zero right we don't sell
any simple clean computers now you can
there's how you build it can be clean
but the computer people want to buy
that's say you know phone or data center
such as a large number of instructions
computes the dependency graph and then
executes it in a way that gets the right
answers and optimizes that graph somehow
yeah they run deeply out of order and
then there's semantics around how memory
ordering works and other things work so
the the computer sort of has a bunch of
bookkeeping tables it says what order
CDs operations finishing or appear to
finish him but to go fast you have to
fetch a lot of instructions and find all
the parallelism now there's a second
kind of computer which we call GPUs
today and I called the difference
there's found parallelism like you have
a program with a lot of dependent
instructions you fetch a bunch and then
you go figure out the dependency graph
and you issues instructions out order
that's because you have one serial
narrative to execute which in fact is
and can be done out of order you call a
narrative yeah well so yeah so humans
think of serial narrative so read read a
book right there's a you know there's
the sends after sentence after sentence
and there's paragraphs
now you could diagram that
imagine you diagrammed it properly and
you said which sentences could be read
in anti order any order without changing
the meaning right but that's a
fascinating question to ask of a book
yeah yeah you could do that right so
some paragraphs could be reordered some
sentences can be reordered you could say
he is tall and smart and X right and it
doesn't matter the order of tall and
smart but if you say is that tall man
who's wearing a red shirt what colors
you know like you can create
dependencies right right and so GPUs on
the other hand run simple programs on
pixels but you're given a million of
them and the first order the screen
you're looking at it doesn't care which
order you do it in so I call that given
parallelism simple narratives around the
large numbers of things where you can
just say it's parallel because you told
me it was so found parallelism where the
narrative is it's sequential but you
discover like little pockets of
parallelism of versus turns out large
pockets of parallelism large so how hard
is it to discuss well how hard is it
that's just transistor count right so
once you crack the problem you say
here's how you fetch ten instructions at
a time here's how you calculated the
dependencies between them here's how you
describe the dependencies here's you
know these are pieces right so once you
describe the dependencies then it's just
a graph sort of it's an algorithm that
finds what is that I'm sure there's a
graph there is the theoretical answer
here that's solved well in general
programs modern programs like human
beings right
how much found parallelism is there and
on that I max what is 10 next mean oh
well you execute it in order vs. yeah
you would get what's called cycles per
instruction and it would be about you
know three instructions three cycles per
instruction because of the latency of
the operations and stuff
and in a modern computer excuse it but
like point to 0.25 cycles per
instruction so it's about with today
fine 10x and there and there's two
things one is the found parallelism in
the narrative right and the other is to
predictability of the narrative right so
certain operations they do a bunch of
calculations and if greater than one do
this else do that that that decision is
predicted in modern computers to high
90% accuracy so branches happen a lot so
imagine you have you have a decision to
make every six instructions which is
about the average right but you want to
fetch five under instructions figure out
the graph and execute them all in
parallel that means you have let's say
if you effect 600 instructions it's
every six you have to fetch you have to
predict ninety-nine out of a hundred
branches correctly for that window to be
effective okay so parallelism you can't
paralyze branches or you can looking
pretty you can what is predict a branch
mean or what open take so imagine you do
a computation over and over you're in a
loop so Wow
and it's greater than one do and you go
through that loop a million times so
every time you look at the branch you
say it's probably still greater than one
he's saying you could do that accurately
very accurately monitoring comes my mind
is blown how the heck did you that wait
a minute
well you want to know this is really sad
20 years ago
yes you simply recorded which way the
branch went last time and predicted the
same thing right okay what's the
accuracy of that 85% so then somebody
said hey let's keep a couple of bits and
have a little counter so and it predicts
one way we count up and then pins so say
you have a three bit counter so you
count up and then count down and if it's
you know you can use the top bit as the
sign bit so you have a sign to bit
number so if it's greater than one you
predict taken and lesson one you predict
not-taken
right or less than zero or whatever the
thing is
and that got us to 92% oh okay I know is
this better
this branch depends on how you got there
so if you came down the code one way
you're talking about Bob and Jane right
and then said is just Bob like Jane
Enoch went one way but if you're talking
about Bob and Jill this Bob like changes
you go a different way
right so that's called history so you
take the history and a counter that's
cool but that's not how anything works
today they use something that looks a
little like a neural network so modern
you take all the execution flows and
then you do basically deep pattern
recognition of how the program is
executing and you do that multiple
different ways and you have something
that chooses what the best result is
there's a little supercomputer inside
the computer that's trying to project
that calculates which way branches go so
the effective window that it's worth
finding grassing gets bigger why was
that gonna make me sad that's amazing
it's amazingly complicated oh well
here's the funny thing so to get to 85%
took a thousand bits to get to 99% takes
tens of megabits
so this is one of those to get the
result you want you know to get from a
window of say 50 instructions to 500 it
took three orders of magnitudes or four
orders of magnitude toward bits now if
you get the prediction of a branch wrong
what happens then what is the pipe you
flush the pipes is just the performance
cost but it gets even better yeah so
we're starting to look at stuff that
says so executed down this path and then
you had two ways to go but far far away
there's something that doesn't matter
which path you went so you miss you took
the wrong path you executed a bunch of
stuff then you had to miss predicting
too backed it up but you remembered all
the results you already calculated some
of those are just fine
look if you read a book and you
misunderstand the paragraph your
understanding is the next paragraph
sometimes is invariance I don't
understand you sometimes it depends on
it and you can kind of anticipate that
invariance yeah well you can keep track
of whether that data changed and so when
you come back to a piece of code should
you calculate it again or do the same
thing okay how much does this is art and
how much of it is science because it
sounds pretty complicated so well how do
you describe a situation so imagine you
come to a point in the road we have to
make a decision right and you have a
bunch of knowledge about which way to go
maybe you have a map so you want to go
is the shortest way or do you want to go
the fastest way or you want to take the
nicest Road so it's just some set of
data so imagine you're doing something
complicated like a building a computer
and there's hundreds of decision points
all with hundreds of possible ways to go
and the ways you pick interacts in a
complicated way right and then you have
to pick the right spot right so those
are there so I don't know yeah avoided
the question you just described do the
Robert Frost poem road less taken I
describe the Robin truss problem which
we do as computer designers it's all
poetry ok great
yeah I don't know how to describe that
because some people are very good at
making those intuitive leaps it seems
like the combinations of things some
people are less good at it but they're
really good at evaluating your
alternatives right and everybody has a
different way to do it and some people
can't make those sleeps but they're
really good at analyzing it so when you
see computers are designed by teams of
people who have very different skill
sets and a good team has lots of
different kinds of people and I suspect
you would describe some of them as
artistic right but not very many
unfortunately or fortunately fortunately
well you know computer science hard it's
99% perspiration
and the 1% inspiration is really
important but I need the 99 yeah you got
to do a lot of work and then there's
there are interesting things to do at
every level that stack so at the end of
the day if you're on the same program
multiple times does it always produce
the same result is is there some room
for fuzziness there that's a math
problem so if you run a correct C
program the definition is every time you
run it you get the same answer yeah that
well that's a math statement but that's
a that's a language definitional
statement so yes for years when people
did when we first did 3d acceleration of
graphics you could run the same scene
multiple times and get different answers
right right and then some people thought
that was okay and some people thought it
was a bad idea and then when the HPC
world used GPUs for calculations they
thought it's a really bad idea okay now
in modern AI stuff people are looking at
networks where the precision of the data
is low enough that the date has somewhat
noisy and the observation as the input
data is unbelievably noisy so why should
the calculation be not noisy and people
have experimented with algorithms that
say can get faster answers by being
noisy like as the network starts to
converge if you look at the computation
graph it starts out really wide and it
gets narrower and you can say is that
last little bit that important or should
I start to graph on the next rap rev
before we would live all the way down to
the answer right so you can create
algorithms that are noisy now if you're
developing something and every time you
run it you get a different answer it's
really annoying and so most people think
even today every time you run the
program you get the same answer now I
know but the question is that's the
formal definition of a programming
language there is a definition of
languages that don't get the same answer
but people who use those you always want
something because you get a bad answer
and then you're wondering is it because
right
something in your brother because of
this and so everybody wants a little
swish that says no matter what ya do it
deterministically and it's really weird
because almost everything going into
monetary calculations is noisy
so why the answers have to be so clear
it's right so where do you stand by
design computers for people who run
programs so somebody says I want in
deterministic answer like most people
want that can you deliver a
deterministic answer I guess is the
question like when you hopefully sure
that's what people don't realize is you
get a deterministic answer even though
the execution flow is very own
deterministic so if you run this program
a hundred times it never runs the same
way twice ever and the answer it arises
the same in but it gets the same answer
every time it's just just them is just
amazing okay you've achieved in eyes of
many people legend status as a cheap art
architect what design creation are you
most proud of
perhaps because it was challenging
because of its impact or because of the
set of brilliant ideas that that were
involved in well I find that description
odd and I has two small children and I
promise you they think it's hilarious
this question yeah so I dude so I I'm
I'm really interested in building
computers and I've worked with really
really smart people
I'm not unbelievably smart I'm
fascinated by how they go together both
as a as a thing to do and is endeavor
that people do how people in computers
go together yeah like how people think
and build a computer and I find
sometimes that the best computer
architects aren't that interested in
people or the best people managers
aren't that good at designing computers
so the whole stack of human beings is
fascinating so the managers individual
engineers yeah I just I said I realized
after a lot of years of building
computers where you sort of build them
out of the transistors logic gates
functional units come
computational elements that you could
think of people the same way so people
are functional units yes and then you
can think of organizational design it's
a computer architectural problem and
then it's like oh that's super cool
because the people are all different
just like the computation elephants are
all different and they like to do
different things and and so I had a lot
of fun like reframing how I think about
organizations just like with with
computers we were saying execution paths
you can have a lot of different paths
that end up at a at at the same good
destination so what have you learned
about the human abstractions from
individual functional human units to the
broader organization what does it take
to create something special well most
people don't think simple enough all
right so do you know the difference
between a recipe and understanding
there's probably a philosophical
description of this so imagine you can
make a loaf of bread yeah the recipe
says get some flour add some water add
some yeast mix it up let it rise put it
in a pan put it in the oven it's a
recipe right understanding bread you can
understand biology supply chains
you know grain grinders yeast physics
you know thermodynamics like there's so
many levels of understanding there and
then when people build and design things
they frequently are executing some stack
of recipes right and the problem with
that is the recipes all have a limited
scope look if you have a really good
recipe book for making bread it won't
tell you anything about how to make an
omelet right right but if you have a
deep understanding of cooking right then
bread omelets you know sandwich you know
there's there's a different you know way
of viewing everything and most people
when you get to be an expert at
something you know you're you're hoping
to achieve deeper understanding not just
a large set of recipes to go execute
and it's interesting the walk groups of
people because xqt reps apiece is
unbelievably efficient if it's what you
want to do if it's not what you want to
do you're really stuck and and that
difference is crucial and ever and
everybody has a balance of let's say
deeper understanding recipes and some
people are really good at recognizing
when the problem is to understand
something DP deeply that make sense
it totally makes sense does it every
stage of development deep on
understanding on the team needed oh this
goes back to the art versus science
question sure if you constantly unpacked
everything for deeper understanding you
never get anything done right and if you
don't unpack understanding when you need
to you'll do the wrong thing and then at
every juncture like human beings are
these really weird things because
everything you tell them has a million
possible outputs all right and then they
all interact in a hilarious way and then
having some intuition about what you
tell them what you do when do you
intervene when do you not it's it's
complicated all right so it's you know
essentially computationally unsolvable
yeah it's an intractable problem sure
humans are a mess but with deep
understanding do you mean also sort of
fundamental questions of things like
what is a computer or why like think the
why question is why are we even building
this like of purpose or do you mean more
like going towards the fundamental
limits of physics sort of really getting
into the core of the sighs well in terms
of building the computer thinks simple
think a little simpler so common
practice is you build a computer and
then when somebody says I want to make
it 10% faster you'll go in and say
alright I need to make this buffer
bigger and maybe I'll add an ad unit or
you know I have this thing that's three
instructions wide I'm going to make it
four instructions wide and what you see
is each piece gets incrementally more
complicated right
and then at some point you hit this
limit like adding another feature or a
buffer doesn't seem to make it any
faster and then people say well that's
because it's a fundamental limit and
then somebody else to look at it and say
well actually the way you divided the
problem up and the way that different
features are interacting is limiting you
and it has to be rethought rewritten
right so then you refactor it and
rewrite it and what people commonly find
is the rewrite is not only faster but
half is complicated from scratch yes so
how often in your career but just have
you seen as needed maybe more generally
to just throw the whole out thing out
this is where I'm on one end of it every
three to five years which end are you on
like rewrite more often right and three
or five years is if you want to really
make a lot of progress on computer
architecture every five years you should
do one from scratch so where does the
x86 64 standard come in or what how
often do you I wrote the I was the
co-author that's back in 98 that's 20
years ago yeah so that's still around
the instruction set it stuff has been
extended quite a few times yes and
instruction sets are less interesting
and implementation underneath there's
been on x86 architecture Intel's
designed a few Eames is designed a few
very different architectures and I don't
want to go into too much of the detail
about how often but it's there's a
tendency to rewrite it every you know 10
years and it really should be every five
so you're saying you're an outlier in
that sense in really more often we write
more often well in here isn't that scary
yeah of course
well scary - who - everybody involved
because like you said repeating the
recipe is efficient companies want to
make money well no in the individual
juniors want to succeed so you want to
incrementally improve increase the
buffer from three to four well we get
into the diminishing return curves I
think Steve Jobs said this right so
every
you have a project and you start here
and it goes up and they have Domitian
return and to get to the next level you
have to do a new one in the initial
starting point will be lower than the
old optimization point but it'll get
higher so now you have two kinds of fear
short-term disaster and long-term
disaster and you're you're wrong right
like you know people with a quarter by
quarter business objective are terrified
about changing everything yeah and
people who are trying to run a business
or build a computer for a long term
objective know that the short-term
limitations block them from the long
term success so if you look at leaders
of companies that had really good
long-term success every time they saw
that they had to redo something they did
and so somebody has to speak up or you
do multiple projects in parallel like
you optimize the old one while you build
a new one and but the marketing guys
they're always like make promise me that
the new computer is faster on every
single thing and the computer architect
says well the new computer will be
faster on the average but there's a
distribution or results in performance
and you'll have some outliers that are
slower and that's very hard because they
have one customer cares about that one
so speaking of the long-term for over 50
years now
Moore's law has served a for me and
millions of others as an inspiring
beacon what kind of amazing future
brilliant engineers can build no I'm
just making your kids laugh all of today
it was great so first in your eyes what
is Moore's law if you could define for
people who don't know well the simple
statement was from Gordon Moore was
double the number of transistors every
two years something like that and then
my operational model is we increased the
performance of computers by 2x every 2
or 3 years and it's wiggled around
substantially over time and also in how
we deliver performance has changed
but the foundational idea was to X two
transistors every two years the current
cadence is something like they call it a
shrink factor like point six every two
years which is not 0.5 but that that's
referring strictly again to the original
definition of transistor count a shrink
factors just getting them smaller small
as well as you use for a constant chip
area if you make the transistor smaller
by 0.6 then you get 1 over 0.6 more
transistors so can you linger a little
longer what's what's a broader what do
you think should be the broader
definition of Moore's law we mentioned
before how you think of performance just
broadly what's a good way to think about
Moore's law well first of all so I I've
been aware of Moore's law for 30 years
in what sense well I've been designing
computers for 40 just watching it before
your eyes kind of slow and somewhere
where I became aware of it I was also
informed that Moore's law was gonna die
in 10 to 15 years and I thought that was
true at first but then after 10 years it
was gonna die in 10 to 15 years and then
at one point it was gonna die in 5 years
and then it went back up to ten years
and at some point I decided not to worry
about that particular product
mastication for the rest of my life
which is which is fun and then I joined
Intel and everybody said Moore's law is
dead and I thought that's sad because
it's the Moore's law company and it's
not dead and it's always been gonna die
and you know humans you like these
apocryphal kind of statements like we'll
run out of food or run out of air or you
know something right but it's still
incredible this lived for as long as it
has and yes
there's many people who believe now that
Moore's Law instead you know they can
join the last 50 years of people had the
thing yeah there's a long tradition but
why do you think if you can in touch try
to understand it why do you think it's
not dead well for Hartley let's just
think people think Moore's law is one
thing transistors get smaller but
actually under the sheets ours literally
thousands of innovations and almost all
those innovations have their own
diminishing return curves so if you
graph it it looks like a cascade of
diminishing return curves I don't know
what to call that but the result is an
exponential curve at least it has been
so and we keep inventing new things so
if you're an expert in one of the things
on a diminishing return curve right and
you can see it's plateau you will
probably tell people well this is this
is done meanwhile some other pile of
people are doing something different so
that's that's just normal so then
there's the observation of how small
could a switching device be so a modern
transistor is something like a thousand
by a thousand by thousand atoms right
and you get quantum effects down around
two to two to ten atoms so you can
imagine the transistor as small as 10 by
10 by 10 so that's a million times
smaller and then the quantum
computational people are working away at
how to use quantum effects so a thousand
by thousand five thousand atoms it's a
really clean way of putting it well fin
like a modern transistor if you look at
the fan it's like a hundred and twenty
atoms wide but we can make that thinner
and then there's there's a gate wrapped
around it and under spacing there's a
whole bunch of geometry and you know a
competent transistor designer could
count both atoms in every single
direction like there's techniques now to
already put down atoms in a single
atomic layer and you can place atoms if
you want to it's just you know from a
manufacturing process if placing an atom
takes ten minutes and you need to put
you know 10 to the 23rd atoms together
to make a computer it would take a long
time so the the methods are you know
both shrinking things and then coming up
with effective ways to control what's
happening manufacture stabling cheaply
yeah so the innovation stocks pretty
broad you know there
there's equipment there's optics there's
chemistry there's physics there's
material science there's metallurgy
there's lots of ideas about when you put
their four materials together how they
interact are they stable is I stable or
temperature you know like are they
repeatable you know there's look there's
like literally thousands of technologies
involved but just for the shrinking you
don't think we're quite yet close to the
fundamental limit in physics I did a
talk on Moore's Law and I asked for a
road map to a path of 100 and after two
weeks they said we only got to fifty a
hundred what's a 100 extra hundred
shrink we only got 15 I said once you go
to another two weeks well here's the
thing about Moore's law right so I
believe that the next 10 or 20 years of
shrinking is going to happen right now
as a computer designer there's you have
two stances you think it's going to
shrink in which case you're designing
and thinking about architecture in a way
that you'll use more transistors or
conversely not be swamped by the
complexity of all the transistors you
get right you have to have a strategy
you know so you're open to the
possibility and waiting for the
possibility of a whole new army of
transistors ready to work I'm expecting
expecting more transistors every two or
three years by a number large enough
that how you think about design how you
think about architecture has to change
like imagine you're you build built
brick buildings out of bricks and every
year the bricks are half the size or
every two years well if you kept
building bricks the same way you know so
many bricks per person per day the
amount of time to build a building would
go up exponentially right right but if
you said I know that's coming so now I'm
going to design equipment and moves
bricks faster uses them better because
maybe you're getting something out of
the smaller bricks more strengths inner
walls you know less material efficiency
out of that so once you have a roadmap
with what's going to happen transistors
they're gonna get we're gonna get more
of them then you design
was collateral rounded to take advantage
of it and also to cope with it like
that's the thing people to understand
it's like if I didn't believe in Moore's
law and Moore's law transistors showed
up my design teams were all drowned so
what's the what's the hardest part of
this in flood of new transistors I mean
even if you just look historically
throughout your career what's what's the
thing you what fundamentally changes
when you add more transistors in in the
task of designing an architecture no
there's there's two constants right one
is people don't get smarter I think by
the way there's some size shown that we
do get smarter because nutrition
whatever
sorry bring that what effect yes nobody
understands it nobody knows if it's
still going on so that's all or whether
it's real or not but yeah that's a I
sort of Amen but not if I believe for
the most part people aren't getting much
smarter the evidence doesn't support it
that's right and then teams can't grow
that much right all right so human
beings understand you know we're really
good in teams of ten you know up two
teams of a hundred they can know each
other beyond that you have to have
organizational boundaries so you're kind
of you have those are pretty hard
constraints all right so then you have
to divide and conquer like as the
designs get bigger you have to divide it
into pieces you know that the power of
abstraction layers is really high we
used to build computers out of
transistors now we have a team that
turns transistors and logic cells and
our team that turns them into functional
you know it's another one it turns in
computers right so we have abstraction
layers in there and you have to think
about when do you shift gears on that we
also use faster computers to build
faster computers so some algorithms run
twice as fast on new computers but a lot
about rhythms are N squared so you know
a computer with twice as many
transistors and it might take four Tom's
times as long to run so you have to
refactor at the software like simply
using faster computers to build bigger
computers doesn't work so so you have to
think about all these things so in terms
of computing performance and the
exciting possibility that more powerful
computers bring
is shrinking the thing we've been
talking about one of the for you one of
the biggest exciting possibilities of
advancement in performance or is there
are other directions that you're
interested in like like in the direction
of sort of enforcing given parallelism
or like doing massive parallelism in
terms of many many CPUs you know
stacking CPUs on top of each other that
kind of that kind of parallelism or you
kind of well think about it a different
way so old computers you know slow
computers you said a equal B plus C
times D pretty simple right and then we
made faster computers with vector units
and you can do proper equations and
matrices right and then modern like AI
computations or like convolutional
neural networks we you convolve one
large data set against another and so
there's sort of this hierarchy of
mathematics
you know from simple equation to linear
equations to matrix equations to it's a
deeper kind of computation and the data
sets are getting so big that people are
thinking of data as a topology problem
you know data is organized in some
immense shape and then the computation
which sort of wants to be get data from
immense shape and do some computation on
it so the with computers of a lot of
people to do is how about rhythms go
much much further so that that paper you
you reference the Sutton paper they
talked about you know like in a I
started it was a ploy rule sets to
something that's a very simple
computational situation and then when
they did first chess thing they solved
deep searches so have a huge database of
moves and results deep search but it's
still just a search right now we we take
large numbers of images and we use it to
Train these weight sets that we convolve
across it's a completely different kind
of phenomena we call that AI now they're
doing the next generation and if you
look at it they're going up this mathema
graph right and then computations the
both computation and data sets support
going up that graph yeah the kind of
computation of my I mean I would argue
that all of it is still a search right
just like you said a topology problems
data says he's searching the data sets
for valuable data and also the actual
optimization of your networks is a kind
of search for the I don't know if you
looked at the inner layers of finding a
cat it's not a search it's it's a set of
endless projection so you know
projection and here's a shadow of this
phone yeah right then you can have a
shadow of that onto something a shadow
on that or something if you look in the
layers you'll see this layer actually
describes pointy ears and round eyeness
and fuzziness and but the computation to
tease out the attributes is not search
right ain't like the inference part
might be searched but the trainings not
search okay well 10 then in deep
networks they look at layers and they
don't even know it's represented and yet
if you take the layers out it doesn't
work ok so if I don't think it's search
all right well but you have to talk to
my mathematician about what that
actually is oh you disagree but the the
it's just semantics I think it's not but
it's certainly not I would say it's
absolutely not semantics but okay all
right well if you want to go there so
optimization to me is search and we're
trying to optimize the ability of a
neural network to detect cat ears and
this difference between chess and the
space the incredibly multi-dimensional
hundred thousand dimensional space that
you know networks are trying to optimize
over is nothing like the chessboard
database so it's a totally different
kind of thing and okay in that sense you
can say yeah yeah you know I could see
how you you might say if if you the
funny thing is it's the difference
between given search space and found
search space exactly yeah maybe that's a
different way
beautiful but okay but you're saying
what's your sense in terms of the basic
mathematical operations and the
architectures can be hardwired that
enables those operations do you see the
CPUs of today still being a really core
part of executing those mathematical
operations yes
well the operations you know continue to
be add subtract loads or compare and
branch it's it's remarkable so it's it's
interesting that the building blocks of
you know computers or transistors and
you know under that atoms so you got
atoms transistors logic gates computers
right you know functional units and
computers the building blocks of
mathematics at some level are things
like adds and subtracts and multiplies
but that's the space mathematics can
describe is I think essentially infinite
but the computers that run the
algorithms are still doing the same
things now a given algorithm may say I
need sparse data or I need 32-bit data
or I need you know like a convolution
operation that naturally takes 8-bit
data multiplies it and sums it up a
certain way so the like the data types
in tensorflow imply an optimization set
but when you go write down a look at the
computers it's an inorganic salt applies
like like that hasn't changed much
now the quantum researchers think
they're going to change that radically
and then there's people who think about
analog computing because you look in the
brain and it seems to be more analog ish
you know that maybe there's a way to do
that more efficiently but we have a
million acts on computation and I don't
know the reference the relationship
between computational let's say
intensity and ability to hit match
mathematical abstractions I don't know
anyway subscribe dad but but just like
you saw an AI you went from rule sets
the simple search to complex search does
a found search like those are you know
orders of magnitude more computation to
do
and as we get the next two orders of
magnitude your friend Roger godori said
like every order magnitude changed the
computation fundamentally changes what
the computation is doing here oh you
know the expression the difference in
quantity is the difference in kind you
know the difference between ant and ant
hill right or neuron and brain you know
there's there's there's just indefinable
place where the the quantity changed the
quality right now we've seen that happen
in mathematics multiple times and you
know my my guess is it's gonna keep
happening so your senses yeah if you
focus head down and shrinking a
transistor let's not just head down and
we're aware about the software stacks
that are running in the computational
loads and we're kind of pondering what
do you do with a petabyte of memory that
wants to be accessed in a sparse way and
have you know the kind of calculations
ai programmers want so there's that
there's a dialog interaction but when
you go in the computer chip you know you
find adders and subtractors and
multipliers and so if you zoom out then
with as you mentioned which Sutton the
idea that most of the development in the
last many decades in the AI research
came from just leveraging computation
and just the simple algorithms waiting
for the computation to improve well
suffer guys have a thing that they
called the the problem of early
optimization right so if you write a big
software stack and if you start
optimizing like the first thing you
write the odds of that being the
performance limiter is low but when you
get the whole thing working can you make
it to X faster by optimizing the right
things sure while you're optimizing that
could you've written a new software
stack which would have been a better
choice maybe now you have creative
tension so but the whole time as you're
doing the writing the that's the
software we're talking about the
hardware underneath gets faster which
goes back to the Moore's laws Moore's
Law is going to continue then your AI
research
should expect that to show up and then
you make a slightly different set of
choices then we've hit the wall
nothing's gonna happen and from here
it's just us rewriting algorithms like
that seems like a failed strategy for
the last 30 years of Moore's laws death
so so can you just linger on it I think
you've answered it but it just asked the
same dumb question over and over so what
why do you think Moore's law is not
going to die which is the most promising
exciting possibility of why it won't
done that's five 10 years so is it that
continues shrinking the transistor or is
it another s-curve that steps in and it
totally so dope shrinking the transistor
is literally thousands of innovations
right so there's so this they're all
answers and it's there's a whole bunch
of s-curves just kind of running their
course and being reinvented and new
things you know the the semiconductor
fabricators and technologists have all
announced what's called nano wires so
they they took a fan which had a gate
around it and turned that into a little
wire so you have better control that and
they're smaller and then from there
there's some obvious steps about how to
shrink that so the metallurgy around
wire stocks and stuff has very obvious
abilities to shrink and you know there's
a whole combination of things there to
do your sense is that we're gonna get a
lot yes this innovation from just that
shrinking yeah like a factor of a
hundred salade yeah I would say that's
incredible and it's totally it's only 10
or 15 years now you're smarter you might
know but to me it's totally
unpredictable of what that hundred x
would bring in terms of the nature of
the computation and people be yeah you
familiar with Bell's law so for a long
time those mainframes
Mini's workstation PC mobile Moore's Law
drove faster smaller computers right and
then we were thinking about Moore's law
rajae godori said every 10x generates a
new computation so scalar vector made
Erichs topological computation right and
if you go look at the industry trans
there was no mainframes and
mini-computers and PCs and then the
internet took off and then we got mobile
devices and now we're building 5g
wireless with one millisecond latency
and people are starting to think about
the smart world where everything knows
you recognizes you like like like the
transformations are going to be like
unpredictable how does it make you feel
that you're one of the key architects of
this kind of futures you're not we're
not talking about the architects of the
high-level people who build the Angry
Bird apps and flapping
Angry Bird of who knows we're gonna be
that's the whole point of the universe
let's take a stand at that and the
attention distracting nature of mobile
phones I'll take a stand but anyway in
terms of that matters much the the side
effects of smartphones or the attention
distraction which part well who knows
you know where this is all leading it's
changing so fast wax my parents do steal
my sister's for hiding in the closet
with a wired phone with a dial on it
stop talking your friends all day right
now my wife feels with my kids for
talking to their friends all day on text
looks the same to me it's always it's
echoes of the same thing okay but you
are the one of the key people
architecting the hardware of this future
how does that make you feel do you feel
responsible do you feel excited so we're
we're in a social context so there's
billions of people on this planet there
are literally millions of people working
on technology I feel lucky to be you
know what doing what I do and getting
paid for it and there's an interest in
it but there's so many things going on
in parallel it's like the actions are so
unpredictable if I wasn't here somebody
else are doing the the vectors of all
these different things are happening all
the time you know there's a
I'm sure some philosopher or meta
philosophers you know wondering about
how we transform our world so you can't
deny the fact that these tools whether
that these tools are changing our world
that's right do you think it's changing
for the better so some of these I read
this thing recently it said the peat the
two disciplines with the highest GRE
scores in college are physics in
philosophy right and they're both sort
of trying to answer the question why is
there anything right and the
Philosopher's you know are on the kind
of theological side and the physicists
are obviously on the you know the
material side and there's a hundred
billion galaxies with a hundred billion
stars it seems well repetitive at best
so I you know there's on our way to ten
billion people I mean it's hard to say
what it's all for is that's what you're
asking
yeah I guess I guess I do tend to are
significantly increases in complexity
and I'm curious about how computation
like like our world our physical world
inherently generates mathematics it's
kind of obvious right so we have X Y Z
coordinates you take a sphere you make
it bigger you get a surface that falls
you know grows by r-squared like it
generally generates mathematics and the
mathematicians and the physicists have
been having a lot of fun talking to each
other for years and computation has been
let's say relatively pedestrian like
computation in terms of mathematics has
been doing binary binary algebra while
those guys have been gallivanting
through the other realms of possibility
right now recently the computation lets
you do math m'q mathematical
computations that are sophisticated
enough that nobody understands how the
answers came out right machine learning
machine lying yeah it used to be you get
data set you guess at a function the
function is considered physics if it's
predictive of new functions
data sets modern you can take a large
data set with no intuition about what it
is and use machine learning to find a
pattern that has no function right and
it can arrive at results that I don't
know if they're completely
mathematically describable so a
computation is kind of done something
interesting compared to a POV plus see
there's something reminiscent of that
step from the basic operations of
addition to taking a step towards new
all networks that's reminiscent of what
life on Earth and its origins was doing
do you think we're creating sort of the
next step in our evolution in creating
artificial intelligence systems that I
don't know I mean you know if there's so
much in the universe already it's hard
to say well I'm standing in his hold are
human beings working on additional
abstraction layers and possibilities yet
appear so does that mean that human
beings don't need dogs you know no like
like there's so many things that are all
simultaneously interesting and useful
but you've seen through I agree you've
seen great and greater level
abstractions and built in artificial
machines right do you think when you
look at humans you think that the look
of all life on earth as a single
organism building this thing this
machine that greater and greater levels
of abstraction do you think humans are
the peak the top of the food chain in
this long arc of history on earth or do
you think we're jus

Resume

# Wawancara Eksklusif Jim Keller: Filosofi Desain Chip, Masa Depan AI, dan Evolusi Komputasi

### Inti Sari (Executive Summary)
Video ini membahas wawancara mendalam dengan Jim Keller, figur legendaris di balik arsitektur chip AMD K7, K8, Zen, dan Apple A4/A5, mengenai perbedaan fundamental antara otak manusia dan komputer modern. Keller menjelaskan kompleksitas desain mikroprosesor, relevansi Moore's Law di era modern, serta pandangannya tentang evolusi Kecerdasan Buatan (AI) dan mobil otonom. Diskusi juga menyentuh aspek filosofis tentang *first principles thinking*, pentingnya memahami prinsip dasar daripada sekadar mengikuti "resep", serta bagaimana teknologi akan terus bertransformasi secara eksponensial.

---

### Poin-Poin Kunci (Key Takeaways)
*   **Otak vs. Komputer:** Otak manusia beroperasi sebagai *mesh* dengan informasi yang terdistribusi, sedangkan komputer menggunakan memori global yang terpisah dari unit komputasi.
*   **Paralelisme:** Komputer modern mencapai kecepatan tinggi melalui *found parallelism* (menemukan celah untuk eksekusi paralel) dan *branch prediction* (memprediksi langkah selanjutnya dengan akurasi tinggi).
*   **Filosofi Desain:** Keller mendorong pendekatan *first principles* (prinsip pertama) dan menulis ulang (rewrite) arsitektur setiap 3-5 tahun untuk menghindari jebakan kompleksitas yang tidak perlu, daripada melakukan perbaikan bertahap selama 10 tahun.
*   **Moore's Law:** Hukum Moore belum mati; ini adalah hasil dari ribuan inovasi kecil di berbagai bidang (fisika, kimia, material) yang secara agregat menghasilkan pertumbuhan eksponensial.
*   **AI dan Komputasi:** AI modern bergerak dari pencarian data sederhana (*search*) menuju proyeksi pola yang kompleks (*projection*) menggunakan jaringan saraf tiruan.
*   **Mobil Otonom:** Tantangan terbesar dalam mobil otonom bukan pada kecepatan komputasi, melainkan biaya, efisiensi energi, dan kemampuan untuk memahami niat (model mental) pengemudi lain.
*   **Pesan Hidup:** Penting untuk mengetahui diri sendiri, menyingkirkan asumsi yang salah, dan menemukan makna dalam ketidakpastian.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Perbedaan Otak Manusia dan Arsitektur Komputer
Jim Keller membuka diskusi dengan membandingkan cara kerja otak dan komputer. Mekanisme otak manusia belum sepenuhnya dipahami, namun jelas berbeda secara arsitektur:
*   **Struktur Memori:** Komputer memiliki memori yang terpisah (*decoupled*) dari unit pemrosesan (global memory), sementara otak merupakan jaringan (*mesh*) di mana informasi terdistribusi.
*   **Lapisan Abstraksi:** Komputer dibangun dari lapisan-lapisan abstraksi: atom -> material -> transistor -> gerbang logika -> unit fungsional -> elemen pemrosesan -> perangkat lunak.
*   **Set Instruksi:** Bahasa dasar komputer (x86, ARM) terdiri dari operasi sederhana (muat, simpan, matematika, percabangan). 90% eksekusi program hanya menggunakan sekitar 25 instruksi dasar ini.

#### 2. Performa, Prediksi Cabang, dan Determinisme
Untuk mencapai kecepatan tinggi, komputer modern menggunakan teknik canggih dalam menangani alur instruksi:
*   **Found Parallelism:** Program modern berjalan sekuensial seperti narasi, tetapi komputer mencari "saku" paralelisme untuk mengeksekusi banyak instruksi secara bersamaan (*out of order*).
*   **Prediksi Cabang (*Branch Prediction*):** Karena percabangan terjadi sering (rata-rata setiap 6 instruksi), komputer harus memprediksi jalur mana yang akan diambil. Akurasi meningkat dari 85% (20 tahun lalu) menjadi 99% (sekarang) menggunakan teknik yang mirip jaringan saraf mini. Kesalahan prediksi membuang waktu (flush pipeline), tetapi prediksi yang benar sangat mempercepat sistem.
*   **Determinisme:** Ada perdebatan antara kebutuhan jawaban yang pasti (*deterministic*) untuk debugging/HPC versus komputasi yang *noisy* untuk AI modern. Para pengembang lebih menyukai determinisme karena memudahkan pelacakan kesalahan.

#### 3. Resep vs. Pemahaman dan Manajemen Organisasi
Keller menerapkan prinsip teknik komputer dalam manajemen dan pembelajaran:
*   **Resep (*Recipe*) vs. Pemahaman (*Understanding*):** "Resep" adalah langkah demi langkah tanpa pemahaman mendalam (misal: membuat roti). "Pemahaman" adalah mengetahui prinsip dasar (biologi, fisika) yang memungkinkan fleksibilitas (bisa membuat roti, telur, atau sandwich).
*   **Desain Organisasi:** Melihat orang sebagai "unit fungsional" dengan keahlian berbeda. Manusia adalah masalah komputasi yang "tidak dapat diselesaikan" (*intractable*), sehingga dibutuhkan intuisi dalam manajemen.
*   **Pentingnya Rewrite:** Banyak insinyur hanya melakukan perbaikan bertahap (*incremental*) yang menambah kompleksitas. Keller menyarankan menulis ulang arsitektur dari nol setiap 3-5 tahun untuk hasil yang lebih cepat dan lebih sederhana, menghindari rasa takut akan jangka pendek demi keuntungan jangka panjang.

#### 4. Moore's Law dan Batas Fisika
Mengenai masa depan transistor dan hukum Moore:
*   **Ribuan Inovasi:** Moore's Law bukanlah satu hal, melainkan agregasi dari ribuan inovasi kecil (optik, kimia, metalurgi). Ketika satu inovasi melambat, inovasi lain mengambil alih.
*   **Potensi Penyusutan:** Transistor modern berukuran sekitar 1000 atom. Secara teori, masih ada ruang untuk menyusut hingga 10x10x10 atom (sejuta kali lebih kecil) sebelum efek kuantum menjadi masalah utama.
*   **Tantangan Desain:** Dengan transistor yang lebih banyak, tantangan beralih ke perangkat lunak. Algoritma desain seringkali bersifat kuadratik (*N squared*), di mana penambahan transistor menyebabkan waktu desain melonjak drastis jika tidak di-refactor.

#### 5. Evolusi AI dan Matematika
Pembahasan mengenai bagaimana komputasi mengubah pendekatan kita terhadap data:
*   **Dari Aturan ke Pola:** AI berevolusi dari kumpulan aturan sederhana, ke pencarian mendalam (seperti catur), hingga konvolusi pada dataset gambar besar.
*   **Search vs. Projection:** Ada perdebatan apakah AI adalah "pencarian" (*search*) atau "proyeksi". Keller cenderung melihatnya sebagai proyeksi atribut dalam ruang multi-dimensi yang kompleks.

---

## Kesimpulan & Pesan Penutup
Wawancara dengan Jim Keller menggarisbawahi bahwa kemajuan teknologi yang berkelanjutan, baik dalam desain chip maupun pengembangan AI, sangat bergantung pada pemahaman prinsip dasar (*first principles*) daripada sekadar mengikuti resep yang ada. Beliau menegaskan bahwa inovasi adalah hasil akumulasi dari ribuan perbaikan kecil dan keberanian untuk mendesain ulang sistem guna menghindari kompleksitas yang tidak perlu. Pesan utamanya adalah untuk terus belajar, menyingkirkan asumsi yang salah, dan memahami esensi di balik setiap masalah untuk dapat beradaptasi dengan perubahan eksponensial di dunia komputasi.

Read

file updated 2026-02-13 13:22:36 UTC