File TXT tidak ditemukan.
Jim Keller: Abstraction Layers from the Atom to the Data Center | AI Podcast Clips
7bLeQFhPwzk • 2020-02-16
Transcript preview
Open
Kind: captions
Language: en
so let's get into the basics before we
zoom back out how do you build the
computer from scratch what is a
microprocessor
what is it microarchitecture what's an
instruction set architecture maybe even
as far back as what is a transistor so
the special charm of computer
engineering is there's a relatively good
understanding of abstraction layers so
down to bottom you have atoms and atoms
get put together in materials like
silicon or dope silicon or metal and we
build transistors on top of that we
build logic gates right and in
functional units like an adder or
subtractor or an instruction parsing
unit and we assemble those into you know
processing elements modern computers are
built out of you know probably 10 to 20
locally you know organic processing
elements or coherent processing elements
and then that runs computer programs
right so there's abstraction layers and
then software you know there's an
instruction set you run and then there's
assembly language C C++ Java JavaScript
you know there's abstraction layers you
know essentially from the atom to the
data center right so when you when you
build a computer you know first there's
a target like what's it for look how
fast does it have to be which you know
today there's a whole bunch of metrics
about what that is and then in an
organization of you know a thousand
people who build a computer there's lots
of different disciplines that you have
to operate on does that make sense and
so so there's a bunch of levels
abstraction of in organizational I can
tell and in your own vision there's a
lot of brilliance that comes in it every
one of those layers some of it is
science some what is engineering some of
his art what's the most
if you could pick favorites what's the
most important your favorite layer on
these layers of abstractions where does
the magic enter this hierarchy I don't
really care
that's the fun you know I'm somewhat
agnostic to that so I would say for
relatively long periods of time
instruction sets are stable so the x86
instruction said the arm instruction set
was an instruction set so it says how do
you encode the basic operations load
still or multiply add subtract
conditional branch you know there aren't
that many interesting instructions look
if you look at a program and it runs you
know 90% of the execution is on 25
opcodes you know 25 instructions and
those are stable right what does it mean
stable until architecture has been
around for 25 years it works it works
and that's because the basics you know
are defined a long time ago right now
the way an old computer ran is you
fetched instructions and you executed
them in order to the load do the ad do
the compare the way a modern computer
works is you fetch large numbers of
instructions say 500 and then you find
the dependency graph between the
instructions and then you you execute in
independent units those little micro
graphs so a modern computer like people
like to say computer should be simple
and clean but it turns out the market
for a simple complete clean slow
computers is zero right we don't sell
any simple clean computers now you can
there's how you build it can be clean
but the computer people want to buy
that's say you know phone data center
such as a large number of instructions
computes the dependency graph and then
executes it in a way that gets the right
answers and optimizes that graph somehow
yeah they run deeply out of order and
then there's semantics around how memory
ordering works and other things work so
the computer sort of has a bunch of
bookkeeping tables it says what order
CDs operations finishing or appear to
finish him but to go fast you have to
fetch a lot of instruct
and find all the parallelism now there's
a second kind of computer which we call
GPUs today and I called the difference
there's found parallelism like you have
a program with a lot of dependent
instructions you fetch a bunch and then
you go figure out the dependency graph
and you issues instructions out order
that's because you have one serial
narrative to execute which in fact is
and can be done out of order you call a
narrative yeah well so yeah so humans
think of serial narrative so read it
read a book right there's you know
there's the sense after sentence after
sentence and there's paragraphs
now you could diagram that imagine you
diagrams it properly and you said which
sentences could be read in anti order
any order without changing the meaning
right so that's a fascinating question
that risk of a book yeah yeah you could
do that
right so some paragraphs could be
reordered some sentences can be
reordered you could say he is tall and
smart and X right and it doesn't matter
the order of tall and smart but if you
say that tall man is wearing the red
shirt what colors you know like you can
create dependencies right right and so
GPUs on the other hand run simple
programs on pixels but you're given a
million of them and the first order the
screen you're looking at doesn't care
which order you do it in so I call that
given parallelism simple narratives
around the large numbers of things where
you can just say it's parallel because
you told me it was so found parallelism
where the narrative is it's sequential
but you discover like little pockets of
parallelism of versus turns out large
pockets of parallelism large so how hard
is it to discuss well how hard is it
that's just transistor count right so
once you crack the problem you say
here's how you fetch ten instructions at
a time here's how you calculated the
dependencies between them here's how you
describe the dependencies here's you
know these are pieces right so once you
describe the dependencies then it's just
a graph sort of it's an algorithm that
finds what is that I'm sure there's a
graph there is the theoretical answer
here that's solved but in general
programs modern programs that human
beings right how much found parallelism
is there an ax what is 10x mean well you
execute it in order versus yeah you
would get what's called cycles per
instruction and it would be about you
know three instructions three cycles per
instruction because of the latency of
the operations and stuff and in a modern
computer or execute it but like point to
point point to five cycles per
instruction so it's about with today
fine 10x and there and there's two
things one is the found parallelism in
the narrative right and the other is to
predictability of the narrative right so
certain operations they do a bunch of
calculations and if greater than one do
this else do that that that decision is
predicted in modern computers to high
90% accuracy so branches happen a lot so
imagine you have you have a decision to
make every six instructions which is
about the average right but you want to
fetch five under instructions figure out
the graph and execute them all in
parallel that means you have let's say
if you effect six hundred instructions
it's every six you have to fetch you
have to predict ninety-nine out of a
hundred branches correctly for that
window to be effective okay so
parallelism you can't paralyze branches
or you can looking pretty you can what
does predict a branch mean or what open
take so imagine you do a computation
over and over you're in a loop so Wow
and it's greater than one do and you go
through that loop a million times so
every time you look at the branch you
say it's probably still greater than one
and you're saying you could do that
accurately
very accurately monitoring comes my mind
is blown how the heck did you that wait
a minute well you want to know this is
really sad
20 years ago yes you simply recorded
which way the branch went last time and
predicted the same thing right okay
what's the accuracy of that 85 percent
so then somebody said hey let's keep a
couple of bits and have a little counter
so and it predicts one way we count up
and then pins so say you have a three
bit counter so you count up and then
count down and if it's you know you can
use the top bit as the sign bit so you
have a sign to bit number so if it's
greater than one you predict taken and
lesson one you predict not-taken right
or listen zero or whatever the thing is
and that got us to 92 percent oh okay no
is this better
this branch depends on how you got there
so if you came down the code one way
you're talking about Bob and Jane right
and then said is just Bob like Jane ik
went one way but if you're talking about
Bob until this Bob like changes you go a
different way right so that's called
history so you take the history and a
counter that's cool but that's not how
anything works today they use something
that looks a little like a neural
network so modern you take all the
execution flows and then you do
basically deep pattern recognition of
how the program is executing and you do
that multiple different ways and you
have something that chooses what the
best result is there's a little
supercomputer inside the computer that's
trying to project that calculates which
way branches go so the effective window
that it's worth finding grassing gets
bigger why was that gonna make me sad
that's amazing it's amazingly
complicated oh well here's the funny
thing so to get to 85% took a thousand
bits to get to 99% takes tens of
megabits
so this is one of those to get the
result you you know to get from a window
of say 50 instructions to 500
it took three orders of magnitude or
four orders of magnitude more bets now
if you get the prediction of a branch
wrong what happens then watch the pipe
you flush the pipe says just the
performance cost but it gets even better
yeah so we're starting to look at stuff
that says so they executed down this
path and then you had two ways to go but
far far away there's something that
doesn't matter which path you went so
you miss you took the wrong path you
executed a bunch of stuff then you had
the Miss predicting you backed it up but
you remembered all the results you
already calculated some of those are
just fine look if you read a book and
you misunderstand the paragraph your
understanding is the next paragraph
sometimes is invariant to that I'm not
just understanding sometimes it depends
on it and you can kind of anticipate
that invariance yeah well you can keep
track of whether that data changed and
so when you come back to a piece of code
should you calculate it again or do the
same thing okay how much does this is
art and how much of it is science
because it sounds pretty complicated so
well how do you describe a situation so
imagine you come to a point in the road
we have to make a decision right and you
have a bunch of knowledge about which
way to go maybe you have a map so you
want to go is the shortest way or do you
want to go the fastest way or you want
to take the nicest road so it's just
some set of data so imagine you're doing
something complicated like a building in
the computer and there's hundreds of
decision points all with hundreds of
possible ways to go and the ways you
pick interacts in a complicated way
right and then you have to pick the
right spot right so there's other
science oh I don't know
yeah avoided the question you just
described do the Robert Frost problem of
road less taken I describe the Robin
truss problem which we do as computer
designers it's all poetry okay great
yeah I don't know how to describe that
because some people are very good at
making those intuitive leaps it seems
like the combinations of things some
people are less good at it but they are
really good at evaluating your
alternatives right and everybody has a
different way to do it and some people
can't make those sleeps but they're
really good at analyzing it
so when you see computers are designed
by teams of people of very different
skill sets and a good team has lots of
different kinds of people
I suspect you would describe some of
them as artistic but not very many
unfortunately or fortunately or
something well you know computer science
heart it's 99% perspiration and the 1%
inspiration is really important but you
send you the 99 yeah you got to do a lot
of work and then there's there are
interesting things to do at every level
that stack so at the end of the day if
you're on the same program multiple
times does it always produce the same
result
is is there some room for fuzziness
there that's a math problem so if you
run a correct C program the definition
is every time you run it you get the
same answer
yeah that would that's a math statement
that's a language definitional statement
so yes for years when people did when we
first did 3d acceleration of graphics
you could run the same scene multiple
times and get different answers right
right and then some people thought that
was okay and some people thought it was
a bad idea and then when the HPC world
used GPUs for calculations they thought
it was a really bad idea okay now in
modern AI stuff people are looking at
networks where the precision of the data
is low enough that the date has somewhat
noisy and the observation is the input
data is unbelievably noisy so why should
the calculation be not noisy and people
have experimented with algorithms that
say can get faster answers by being
noisy like as a network starts to
converge if you look at the computation
graph it starts out really wide and it
gets narrower and you can say is that
last little bit that
important or should I start the graph on
the next rap rev before we would live
all the way down to the answer right so
you can create algorithms that are noisy
now if you're developing something and
every time you run it you get a
different answer it's really annoying
and so most people think even today
every time you run the program you get
the same answer now you know but the the
question is that's the formal definition
of a programming language there is a
definition of languages that don't get
the same answer but people who use those
you always want something because you
get a bad answer and then you're
wondering is it because of something in
your brother because of this and so
everybody wants a little switch that
says no matter what do it
deterministically and it's really weird
because almost everything going into
modern calculations is noisy
so why the answers have to be so clear
it's all right so what he used to end by
design computers for people who run
programs so somebody says I want and
deterministic answer like most people
want that can you deliver a
deterministic answer I guess is the
question like when you hopefully sure
that what people don't realize is you
get a deterministic answer even though
the execution flow is very undetermined
distich so if you run this program a
hundred times it never runs the same way
twice ever and the answer is arise at
the same input it gets the same answer
every time it's just just the it's just
amazing
you
Resume
Read
file updated 2026-02-13 13:23:40 UTC
Categories
Manage