Jim Keller: Moore's Law, Microprocessors, and First Principles | Lex Fridman Podcast #70
Nb2tebYAaOA • 2020-02-05
Transcript preview
Open
Kind: captions Language: en the following is a conversation with Jim Keller legendary microprocessor engineer who has worked at AMD Apple Tesla and now Intel he's known for his work on AMD K 7 K 8 K 12 and Xen microarchitectures Apple a4 and a5 processors and co-author of the specification for the x86 64 instruction set and hyper transport interconnect he's a brilliant first principles engineer and out-of-the-box thinker and just an interesting and fun human being to talk to this is the artificial intelligence podcast if you enjoy it subscribe on YouTube give it five stars an apple podcast follow on Spotify supported on patreon or simply connect with me on Twitter Alex Friedman spelled Fri D ma a.m. I recently started doing ads at the end of the introduction I'll do one or two minutes after introducing the episode and never any ads in the middle that can break the flow of the conversation I hope that works for you and doesn't hurt the listening experience this show is presented by cash app the number one finance I up in the App Store I personally use cash app to send money to friends but you can also use it to buy sell and deposit Bitcoin in just seconds cash app also has a new investing feature you can buy fractions of a stock say $1 worth no matter what the stock price is brokers services are provided by cash app investing a subsidiary of square and member si PC I'm excited to be working with cash app to support one of my favorite organizations called first best known for their first robotics and Lego competitions they educate and inspire hundreds of thousands of students in over 110 countries and have a perfect rating a charity navigator which means that donated money is used to maximum effectiveness when you get cash app from the App Store Google Play and use code Lex podcast you'll get ten dollars and cash app will also donate ten dollars to the first which again is an organization that I've personally seen inspire girls and boys the dream of engineering a better world and now here's my with Jim Keller what are the differences in similarities between the human brain and a computer with the microprocessors core let's start with a philosophical question perhaps well since people don't actually understand how human brains work I think that's true I think that's true so it's hard to compare them computers are you know there's really two things there's memory and there's computation right and to date almost all computer architectures are global memory which is a thing right and then computation where you pull data and you do relatively simple operations on it and write data back so it's decoupled in modern in modern computers and you think in the human brain everything's a mesh a mess that's combined together what people observe is there's you know some number of layers of neurons which have local and global connections and information is stored in some distributed fashion and people build things called neural networks in computers where the information is distributed in some kind of fashion you know there's a mathematics behind it I don't know that the understandings that is super deep the computations we run on those are straightforward computations I don't believe anybody has said a neuron does this computation so to date it's hard to compare them I would say so let's get into the basics before we zoom back out how do you build a computer from scratch what is a microprocessor what is it microarchitecture what's an instruction set architecture maybe even as far back as what is a transistor so the special charm of computer engineering is there's a relatively good understanding of abstraction layers so down to bottom you have atoms and atoms get put together in materials like silicon or dope silicon or metal and we build transistors on top of that we build logic gates right and in functional units like an adder or subtractor or an instruction parsing unit and we assemble those into you know processing elements modern computers are built out of you know probably 10 to 20 locally you know organic processing elements or coherent processing elements and then that runs computer programs right so there's abstraction layers and then software you know there's an instruction set you run and then there's assembly language C C++ Java JavaScript you know there's abstraction layers you know essentially from the atom to the data center right so when you when you build a computer you know first there's a target like what's it for look how fast does it have to be which you know today there's a whole bunch of metrics about what that is and then in an organization of you know a thousand people who build a computer there's lots of different disciplines that you have to operate on does that make sense and so so there's a bunch of levels abstraction of in in organizational I can tell and in your own vision there's a lot of brilliance that comes in it every one of those layers some of it is science some was engineering some of his art what's the most if you could pick favorites what's the most important your favorite layer on these layers of abstractions where does the magic enter this hierarchy I don't really care that's the fun you know I'm somewhat agnostic to that so I would say for relatively long periods of time instruction sets are stable so the x86 instruction said the arm instruction set what's an instruction set so it says how do you encode the basic operations load store multiply add subtract conditional branch you know there aren't that many interesting instructions look if you look at a program and it runs you know 90% of the execution is on 25 opcodes you know 25 instructions on those are stable right what does it mean stable until architecture has been around for twenty-five years it works it works and that's because the basics you know or defined a long time ago right now the way an old computer ran is you fetched instructions and you executed them in order to the load do the ad do the compare the way a modern computer works is you fetch large numbers of instructions say 500 and then you find the dependency graph between the instructions and then you you execute in independent units those little micro graphs so a modern computer like people like to say computers should be simple and clean but it turns out the market for a simple complete clean slow computers is zero right we don't sell any simple clean computers now you can there's how you build it can be clean but the computer people want to buy that's say you know phone or data center such as a large number of instructions computes the dependency graph and then executes it in a way that gets the right answers and optimizes that graph somehow yeah they run deeply out of order and then there's semantics around how memory ordering works and other things work so the the computer sort of has a bunch of bookkeeping tables it says what order CDs operations finishing or appear to finish him but to go fast you have to fetch a lot of instructions and find all the parallelism now there's a second kind of computer which we call GPUs today and I called the difference there's found parallelism like you have a program with a lot of dependent instructions you fetch a bunch and then you go figure out the dependency graph and you issues instructions out order that's because you have one serial narrative to execute which in fact is and can be done out of order you call a narrative yeah well so yeah so humans think of serial narrative so read read a book right there's a you know there's the sends after sentence after sentence and there's paragraphs now you could diagram that imagine you diagrammed it properly and you said which sentences could be read in anti order any order without changing the meaning right but that's a fascinating question to ask of a book yeah yeah you could do that right so some paragraphs could be reordered some sentences can be reordered you could say he is tall and smart and X right and it doesn't matter the order of tall and smart but if you say is that tall man who's wearing a red shirt what colors you know like you can create dependencies right right and so GPUs on the other hand run simple programs on pixels but you're given a million of them and the first order the screen you're looking at it doesn't care which order you do it in so I call that given parallelism simple narratives around the large numbers of things where you can just say it's parallel because you told me it was so found parallelism where the narrative is it's sequential but you discover like little pockets of parallelism of versus turns out large pockets of parallelism large so how hard is it to discuss well how hard is it that's just transistor count right so once you crack the problem you say here's how you fetch ten instructions at a time here's how you calculated the dependencies between them here's how you describe the dependencies here's you know these are pieces right so once you describe the dependencies then it's just a graph sort of it's an algorithm that finds what is that I'm sure there's a graph there is the theoretical answer here that's solved well in general programs modern programs like human beings right how much found parallelism is there and on that I max what is 10 next mean oh well you execute it in order vs. yeah you would get what's called cycles per instruction and it would be about you know three instructions three cycles per instruction because of the latency of the operations and stuff and in a modern computer excuse it but like point to 0.25 cycles per instruction so it's about with today fine 10x and there and there's two things one is the found parallelism in the narrative right and the other is to predictability of the narrative right so certain operations they do a bunch of calculations and if greater than one do this else do that that that decision is predicted in modern computers to high 90% accuracy so branches happen a lot so imagine you have you have a decision to make every six instructions which is about the average right but you want to fetch five under instructions figure out the graph and execute them all in parallel that means you have let's say if you effect 600 instructions it's every six you have to fetch you have to predict ninety-nine out of a hundred branches correctly for that window to be effective okay so parallelism you can't paralyze branches or you can looking pretty you can what is predict a branch mean or what open take so imagine you do a computation over and over you're in a loop so Wow and it's greater than one do and you go through that loop a million times so every time you look at the branch you say it's probably still greater than one he's saying you could do that accurately very accurately monitoring comes my mind is blown how the heck did you that wait a minute well you want to know this is really sad 20 years ago yes you simply recorded which way the branch went last time and predicted the same thing right okay what's the accuracy of that 85% so then somebody said hey let's keep a couple of bits and have a little counter so and it predicts one way we count up and then pins so say you have a three bit counter so you count up and then count down and if it's you know you can use the top bit as the sign bit so you have a sign to bit number so if it's greater than one you predict taken and lesson one you predict not-taken right or less than zero or whatever the thing is and that got us to 92% oh okay I know is this better this branch depends on how you got there so if you came down the code one way you're talking about Bob and Jane right and then said is just Bob like Jane Enoch went one way but if you're talking about Bob and Jill this Bob like changes you go a different way right so that's called history so you take the history and a counter that's cool but that's not how anything works today they use something that looks a little like a neural network so modern you take all the execution flows and then you do basically deep pattern recognition of how the program is executing and you do that multiple different ways and you have something that chooses what the best result is there's a little supercomputer inside the computer that's trying to project that calculates which way branches go so the effective window that it's worth finding grassing gets bigger why was that gonna make me sad that's amazing it's amazingly complicated oh well here's the funny thing so to get to 85% took a thousand bits to get to 99% takes tens of megabits so this is one of those to get the result you want you know to get from a window of say 50 instructions to 500 it took three orders of magnitudes or four orders of magnitude toward bits now if you get the prediction of a branch wrong what happens then what is the pipe you flush the pipes is just the performance cost but it gets even better yeah so we're starting to look at stuff that says so executed down this path and then you had two ways to go but far far away there's something that doesn't matter which path you went so you miss you took the wrong path you executed a bunch of stuff then you had to miss predicting too backed it up but you remembered all the results you already calculated some of those are just fine look if you read a book and you misunderstand the paragraph your understanding is the next paragraph sometimes is invariance I don't understand you sometimes it depends on it and you can kind of anticipate that invariance yeah well you can keep track of whether that data changed and so when you come back to a piece of code should you calculate it again or do the same thing okay how much does this is art and how much of it is science because it sounds pretty complicated so well how do you describe a situation so imagine you come to a point in the road we have to make a decision right and you have a bunch of knowledge about which way to go maybe you have a map so you want to go is the shortest way or do you want to go the fastest way or you want to take the nicest Road so it's just some set of data so imagine you're doing something complicated like a building a computer and there's hundreds of decision points all with hundreds of possible ways to go and the ways you pick interacts in a complicated way right and then you have to pick the right spot right so those are there so I don't know yeah avoided the question you just described do the Robert Frost poem road less taken I describe the Robin truss problem which we do as computer designers it's all poetry ok great yeah I don't know how to describe that because some people are very good at making those intuitive leaps it seems like the combinations of things some people are less good at it but they're really good at evaluating your alternatives right and everybody has a different way to do it and some people can't make those sleeps but they're really good at analyzing it so when you see computers are designed by teams of people who have very different skill sets and a good team has lots of different kinds of people and I suspect you would describe some of them as artistic right but not very many unfortunately or fortunately fortunately well you know computer science hard it's 99% perspiration and the 1% inspiration is really important but I need the 99 yeah you got to do a lot of work and then there's there are interesting things to do at every level that stack so at the end of the day if you're on the same program multiple times does it always produce the same result is is there some room for fuzziness there that's a math problem so if you run a correct C program the definition is every time you run it you get the same answer yeah that well that's a math statement but that's a that's a language definitional statement so yes for years when people did when we first did 3d acceleration of graphics you could run the same scene multiple times and get different answers right right and then some people thought that was okay and some people thought it was a bad idea and then when the HPC world used GPUs for calculations they thought it's a really bad idea okay now in modern AI stuff people are looking at networks where the precision of the data is low enough that the date has somewhat noisy and the observation as the input data is unbelievably noisy so why should the calculation be not noisy and people have experimented with algorithms that say can get faster answers by being noisy like as the network starts to converge if you look at the computation graph it starts out really wide and it gets narrower and you can say is that last little bit that important or should I start to graph on the next rap rev before we would live all the way down to the answer right so you can create algorithms that are noisy now if you're developing something and every time you run it you get a different answer it's really annoying and so most people think even today every time you run the program you get the same answer now I know but the question is that's the formal definition of a programming language there is a definition of languages that don't get the same answer but people who use those you always want something because you get a bad answer and then you're wondering is it because right something in your brother because of this and so everybody wants a little swish that says no matter what ya do it deterministically and it's really weird because almost everything going into monetary calculations is noisy so why the answers have to be so clear it's right so where do you stand by design computers for people who run programs so somebody says I want in deterministic answer like most people want that can you deliver a deterministic answer I guess is the question like when you hopefully sure that's what people don't realize is you get a deterministic answer even though the execution flow is very own deterministic so if you run this program a hundred times it never runs the same way twice ever and the answer it arises the same in but it gets the same answer every time it's just just them is just amazing okay you've achieved in eyes of many people legend status as a cheap art architect what design creation are you most proud of perhaps because it was challenging because of its impact or because of the set of brilliant ideas that that were involved in well I find that description odd and I has two small children and I promise you they think it's hilarious this question yeah so I dude so I I'm I'm really interested in building computers and I've worked with really really smart people I'm not unbelievably smart I'm fascinated by how they go together both as a as a thing to do and is endeavor that people do how people in computers go together yeah like how people think and build a computer and I find sometimes that the best computer architects aren't that interested in people or the best people managers aren't that good at designing computers so the whole stack of human beings is fascinating so the managers individual engineers yeah I just I said I realized after a lot of years of building computers where you sort of build them out of the transistors logic gates functional units come computational elements that you could think of people the same way so people are functional units yes and then you can think of organizational design it's a computer architectural problem and then it's like oh that's super cool because the people are all different just like the computation elephants are all different and they like to do different things and and so I had a lot of fun like reframing how I think about organizations just like with with computers we were saying execution paths you can have a lot of different paths that end up at a at at the same good destination so what have you learned about the human abstractions from individual functional human units to the broader organization what does it take to create something special well most people don't think simple enough all right so do you know the difference between a recipe and understanding there's probably a philosophical description of this so imagine you can make a loaf of bread yeah the recipe says get some flour add some water add some yeast mix it up let it rise put it in a pan put it in the oven it's a recipe right understanding bread you can understand biology supply chains you know grain grinders yeast physics you know thermodynamics like there's so many levels of understanding there and then when people build and design things they frequently are executing some stack of recipes right and the problem with that is the recipes all have a limited scope look if you have a really good recipe book for making bread it won't tell you anything about how to make an omelet right right but if you have a deep understanding of cooking right then bread omelets you know sandwich you know there's there's a different you know way of viewing everything and most people when you get to be an expert at something you know you're you're hoping to achieve deeper understanding not just a large set of recipes to go execute and it's interesting the walk groups of people because xqt reps apiece is unbelievably efficient if it's what you want to do if it's not what you want to do you're really stuck and and that difference is crucial and ever and everybody has a balance of let's say deeper understanding recipes and some people are really good at recognizing when the problem is to understand something DP deeply that make sense it totally makes sense does it every stage of development deep on understanding on the team needed oh this goes back to the art versus science question sure if you constantly unpacked everything for deeper understanding you never get anything done right and if you don't unpack understanding when you need to you'll do the wrong thing and then at every juncture like human beings are these really weird things because everything you tell them has a million possible outputs all right and then they all interact in a hilarious way and then having some intuition about what you tell them what you do when do you intervene when do you not it's it's complicated all right so it's you know essentially computationally unsolvable yeah it's an intractable problem sure humans are a mess but with deep understanding do you mean also sort of fundamental questions of things like what is a computer or why like think the why question is why are we even building this like of purpose or do you mean more like going towards the fundamental limits of physics sort of really getting into the core of the sighs well in terms of building the computer thinks simple think a little simpler so common practice is you build a computer and then when somebody says I want to make it 10% faster you'll go in and say alright I need to make this buffer bigger and maybe I'll add an ad unit or you know I have this thing that's three instructions wide I'm going to make it four instructions wide and what you see is each piece gets incrementally more complicated right and then at some point you hit this limit like adding another feature or a buffer doesn't seem to make it any faster and then people say well that's because it's a fundamental limit and then somebody else to look at it and say well actually the way you divided the problem up and the way that different features are interacting is limiting you and it has to be rethought rewritten right so then you refactor it and rewrite it and what people commonly find is the rewrite is not only faster but half is complicated from scratch yes so how often in your career but just have you seen as needed maybe more generally to just throw the whole out thing out this is where I'm on one end of it every three to five years which end are you on like rewrite more often right and three or five years is if you want to really make a lot of progress on computer architecture every five years you should do one from scratch so where does the x86 64 standard come in or what how often do you I wrote the I was the co-author that's back in 98 that's 20 years ago yeah so that's still around the instruction set it stuff has been extended quite a few times yes and instruction sets are less interesting and implementation underneath there's been on x86 architecture Intel's designed a few Eames is designed a few very different architectures and I don't want to go into too much of the detail about how often but it's there's a tendency to rewrite it every you know 10 years and it really should be every five so you're saying you're an outlier in that sense in really more often we write more often well in here isn't that scary yeah of course well scary - who - everybody involved because like you said repeating the recipe is efficient companies want to make money well no in the individual juniors want to succeed so you want to incrementally improve increase the buffer from three to four well we get into the diminishing return curves I think Steve Jobs said this right so every you have a project and you start here and it goes up and they have Domitian return and to get to the next level you have to do a new one in the initial starting point will be lower than the old optimization point but it'll get higher so now you have two kinds of fear short-term disaster and long-term disaster and you're you're wrong right like you know people with a quarter by quarter business objective are terrified about changing everything yeah and people who are trying to run a business or build a computer for a long term objective know that the short-term limitations block them from the long term success so if you look at leaders of companies that had really good long-term success every time they saw that they had to redo something they did and so somebody has to speak up or you do multiple projects in parallel like you optimize the old one while you build a new one and but the marketing guys they're always like make promise me that the new computer is faster on every single thing and the computer architect says well the new computer will be faster on the average but there's a distribution or results in performance and you'll have some outliers that are slower and that's very hard because they have one customer cares about that one so speaking of the long-term for over 50 years now Moore's law has served a for me and millions of others as an inspiring beacon what kind of amazing future brilliant engineers can build no I'm just making your kids laugh all of today it was great so first in your eyes what is Moore's law if you could define for people who don't know well the simple statement was from Gordon Moore was double the number of transistors every two years something like that and then my operational model is we increased the performance of computers by 2x every 2 or 3 years and it's wiggled around substantially over time and also in how we deliver performance has changed but the foundational idea was to X two transistors every two years the current cadence is something like they call it a shrink factor like point six every two years which is not 0.5 but that that's referring strictly again to the original definition of transistor count a shrink factors just getting them smaller small as well as you use for a constant chip area if you make the transistor smaller by 0.6 then you get 1 over 0.6 more transistors so can you linger a little longer what's what's a broader what do you think should be the broader definition of Moore's law we mentioned before how you think of performance just broadly what's a good way to think about Moore's law well first of all so I I've been aware of Moore's law for 30 years in what sense well I've been designing computers for 40 just watching it before your eyes kind of slow and somewhere where I became aware of it I was also informed that Moore's law was gonna die in 10 to 15 years and I thought that was true at first but then after 10 years it was gonna die in 10 to 15 years and then at one point it was gonna die in 5 years and then it went back up to ten years and at some point I decided not to worry about that particular product mastication for the rest of my life which is which is fun and then I joined Intel and everybody said Moore's law is dead and I thought that's sad because it's the Moore's law company and it's not dead and it's always been gonna die and you know humans you like these apocryphal kind of statements like we'll run out of food or run out of air or you know something right but it's still incredible this lived for as long as it has and yes there's many people who believe now that Moore's Law instead you know they can join the last 50 years of people had the thing yeah there's a long tradition but why do you think if you can in touch try to understand it why do you think it's not dead well for Hartley let's just think people think Moore's law is one thing transistors get smaller but actually under the sheets ours literally thousands of innovations and almost all those innovations have their own diminishing return curves so if you graph it it looks like a cascade of diminishing return curves I don't know what to call that but the result is an exponential curve at least it has been so and we keep inventing new things so if you're an expert in one of the things on a diminishing return curve right and you can see it's plateau you will probably tell people well this is this is done meanwhile some other pile of people are doing something different so that's that's just normal so then there's the observation of how small could a switching device be so a modern transistor is something like a thousand by a thousand by thousand atoms right and you get quantum effects down around two to two to ten atoms so you can imagine the transistor as small as 10 by 10 by 10 so that's a million times smaller and then the quantum computational people are working away at how to use quantum effects so a thousand by thousand five thousand atoms it's a really clean way of putting it well fin like a modern transistor if you look at the fan it's like a hundred and twenty atoms wide but we can make that thinner and then there's there's a gate wrapped around it and under spacing there's a whole bunch of geometry and you know a competent transistor designer could count both atoms in every single direction like there's techniques now to already put down atoms in a single atomic layer and you can place atoms if you want to it's just you know from a manufacturing process if placing an atom takes ten minutes and you need to put you know 10 to the 23rd atoms together to make a computer it would take a long time so the the methods are you know both shrinking things and then coming up with effective ways to control what's happening manufacture stabling cheaply yeah so the innovation stocks pretty broad you know there there's equipment there's optics there's chemistry there's physics there's material science there's metallurgy there's lots of ideas about when you put their four materials together how they interact are they stable is I stable or temperature you know like are they repeatable you know there's look there's like literally thousands of technologies involved but just for the shrinking you don't think we're quite yet close to the fundamental limit in physics I did a talk on Moore's Law and I asked for a road map to a path of 100 and after two weeks they said we only got to fifty a hundred what's a 100 extra hundred shrink we only got 15 I said once you go to another two weeks well here's the thing about Moore's law right so I believe that the next 10 or 20 years of shrinking is going to happen right now as a computer designer there's you have two stances you think it's going to shrink in which case you're designing and thinking about architecture in a way that you'll use more transistors or conversely not be swamped by the complexity of all the transistors you get right you have to have a strategy you know so you're open to the possibility and waiting for the possibility of a whole new army of transistors ready to work I'm expecting expecting more transistors every two or three years by a number large enough that how you think about design how you think about architecture has to change like imagine you're you build built brick buildings out of bricks and every year the bricks are half the size or every two years well if you kept building bricks the same way you know so many bricks per person per day the amount of time to build a building would go up exponentially right right but if you said I know that's coming so now I'm going to design equipment and moves bricks faster uses them better because maybe you're getting something out of the smaller bricks more strengths inner walls you know less material efficiency out of that so once you have a roadmap with what's going to happen transistors they're gonna get we're gonna get more of them then you design was collateral rounded to take advantage of it and also to cope with it like that's the thing people to understand it's like if I didn't believe in Moore's law and Moore's law transistors showed up my design teams were all drowned so what's the what's the hardest part of this in flood of new transistors I mean even if you just look historically throughout your career what's what's the thing you what fundamentally changes when you add more transistors in in the task of designing an architecture no there's there's two constants right one is people don't get smarter I think by the way there's some size shown that we do get smarter because nutrition whatever sorry bring that what effect yes nobody understands it nobody knows if it's still going on so that's all or whether it's real or not but yeah that's a I sort of Amen but not if I believe for the most part people aren't getting much smarter the evidence doesn't support it that's right and then teams can't grow that much right all right so human beings understand you know we're really good in teams of ten you know up two teams of a hundred they can know each other beyond that you have to have organizational boundaries so you're kind of you have those are pretty hard constraints all right so then you have to divide and conquer like as the designs get bigger you have to divide it into pieces you know that the power of abstraction layers is really high we used to build computers out of transistors now we have a team that turns transistors and logic cells and our team that turns them into functional you know it's another one it turns in computers right so we have abstraction layers in there and you have to think about when do you shift gears on that we also use faster computers to build faster computers so some algorithms run twice as fast on new computers but a lot about rhythms are N squared so you know a computer with twice as many transistors and it might take four Tom's times as long to run so you have to refactor at the software like simply using faster computers to build bigger computers doesn't work so so you have to think about all these things so in terms of computing performance and the exciting possibility that more powerful computers bring is shrinking the thing we've been talking about one of the for you one of the biggest exciting possibilities of advancement in performance or is there are other directions that you're interested in like like in the direction of sort of enforcing given parallelism or like doing massive parallelism in terms of many many CPUs you know stacking CPUs on top of each other that kind of that kind of parallelism or you kind of well think about it a different way so old computers you know slow computers you said a equal B plus C times D pretty simple right and then we made faster computers with vector units and you can do proper equations and matrices right and then modern like AI computations or like convolutional neural networks we you convolve one large data set against another and so there's sort of this hierarchy of mathematics you know from simple equation to linear equations to matrix equations to it's a deeper kind of computation and the data sets are getting so big that people are thinking of data as a topology problem you know data is organized in some immense shape and then the computation which sort of wants to be get data from immense shape and do some computation on it so the with computers of a lot of people to do is how about rhythms go much much further so that that paper you you reference the Sutton paper they talked about you know like in a I started it was a ploy rule sets to something that's a very simple computational situation and then when they did first chess thing they solved deep searches so have a huge database of moves and results deep search but it's still just a search right now we we take large numbers of images and we use it to Train these weight sets that we convolve across it's a completely different kind of phenomena we call that AI now they're doing the next generation and if you look at it they're going up this mathema graph right and then computations the both computation and data sets support going up that graph yeah the kind of computation of my I mean I would argue that all of it is still a search right just like you said a topology problems data says he's searching the data sets for valuable data and also the actual optimization of your networks is a kind of search for the I don't know if you looked at the inner layers of finding a cat it's not a search it's it's a set of endless projection so you know projection and here's a shadow of this phone yeah right then you can have a shadow of that onto something a shadow on that or something if you look in the layers you'll see this layer actually describes pointy ears and round eyeness and fuzziness and but the computation to tease out the attributes is not search right ain't like the inference part might be searched but the trainings not search okay well 10 then in deep networks they look at layers and they don't even know it's represented and yet if you take the layers out it doesn't work ok so if I don't think it's search all right well but you have to talk to my mathematician about what that actually is oh you disagree but the the it's just semantics I think it's not but it's certainly not I would say it's absolutely not semantics but okay all right well if you want to go there so optimization to me is search and we're trying to optimize the ability of a neural network to detect cat ears and this difference between chess and the space the incredibly multi-dimensional hundred thousand dimensional space that you know networks are trying to optimize over is nothing like the chessboard database so it's a totally different kind of thing and okay in that sense you can say yeah yeah you know I could see how you you might say if if you the funny thing is it's the difference between given search space and found search space exactly yeah maybe that's a different way beautiful but okay but you're saying what's your sense in terms of the basic mathematical operations and the architectures can be hardwired that enables those operations do you see the CPUs of today still being a really core part of executing those mathematical operations yes well the operations you know continue to be add subtract loads or compare and branch it's it's remarkable so it's it's interesting that the building blocks of you know computers or transistors and you know under that atoms so you got atoms transistors logic gates computers right you know functional units and computers the building blocks of mathematics at some level are things like adds and subtracts and multiplies but that's the space mathematics can describe is I think essentially infinite but the computers that run the algorithms are still doing the same things now a given algorithm may say I need sparse data or I need 32-bit data or I need you know like a convolution operation that naturally takes 8-bit data multiplies it and sums it up a certain way so the like the data types in tensorflow imply an optimization set but when you go write down a look at the computers it's an inorganic salt applies like like that hasn't changed much now the quantum researchers think they're going to change that radically and then there's people who think about analog computing because you look in the brain and it seems to be more analog ish you know that maybe there's a way to do that more efficiently but we have a million acts on computation and I don't know the reference the relationship between computational let's say intensity and ability to hit match mathematical abstractions I don't know anyway subscribe dad but but just like you saw an AI you went from rule sets the simple search to complex search does a found search like those are you know orders of magnitude more computation to do and as we get the next two orders of magnitude your friend Roger godori said like every order magnitude changed the computation fundamentally changes what the computation is doing here oh you know the expression the difference in quantity is the difference in kind you know the difference between ant and ant hill right or neuron and brain you know there's there's there's just indefinable place where the the quantity changed the quality right now we've seen that happen in mathematics multiple times and you know my my guess is it's gonna keep happening so your senses yeah if you focus head down and shrinking a transistor let's not just head down and we're aware about the software stacks that are running in the computational loads and we're kind of pondering what do you do with a petabyte of memory that wants to be accessed in a sparse way and have you know the kind of calculations ai programmers want so there's that there's a dialog interaction but when you go in the computer chip you know you find adders and subtractors and multipliers and so if you zoom out then with as you mentioned which Sutton the idea that most of the development in the last many decades in the AI research came from just leveraging computation and just the simple algorithms waiting for the computation to improve well suffer guys have a thing that they called the the problem of early optimization right so if you write a big software stack and if you start optimizing like the first thing you write the odds of that being the performance limiter is low but when you get the whole thing working can you make it to X faster by optimizing the right things sure while you're optimizing that could you've written a new software stack which would have been a better choice maybe now you have creative tension so but the whole time as you're doing the writing the that's the software we're talking about the hardware underneath gets faster which goes back to the Moore's laws Moore's Law is going to continue then your AI research should expect that to show up and then you make a slightly different set of choices then we've hit the wall nothing's gonna happen and from here it's just us rewriting algorithms like that seems like a failed strategy for the last 30 years of Moore's laws death so so can you just linger on it I think you've answered it but it just asked the same dumb question over and over so what why do you think Moore's law is not going to die which is the most promising exciting possibility of why it won't done that's five 10 years so is it that continues shrinking the transistor or is it another s-curve that steps in and it totally so dope shrinking the transistor is literally thousands of innovations right so there's so this they're all answers and it's there's a whole bunch of s-curves just kind of running their course and being reinvented and new things you know the the semiconductor fabricators and technologists have all announced what's called nano wires so they they took a fan which had a gate around it and turned that into a little wire so you have better control that and they're smaller and then from there there's some obvious steps about how to shrink that so the metallurgy around wire stocks and stuff has very obvious abilities to shrink and you know there's a whole combination of things there to do your sense is that we're gonna get a lot yes this innovation from just that shrinking yeah like a factor of a hundred salade yeah I would say that's incredible and it's totally it's only 10 or 15 years now you're smarter you might know but to me it's totally unpredictable of what that hundred x would bring in terms of the nature of the computation and people be yeah you familiar with Bell's law so for a long time those mainframes Mini's workstation PC mobile Moore's Law drove faster smaller computers right and then we were thinking about Moore's law rajae godori said every 10x generates a new computation so scalar vector made Erichs topological computation right and if you go look at the industry trans there was no mainframes and mini-computers and PCs and then the internet took off and then we got mobile devices and now we're building 5g wireless with one millisecond latency and people are starting to think about the smart world where everything knows you recognizes you like like like the transformations are going to be like unpredictable how does it make you feel that you're one of the key architects of this kind of futures you're not we're not talking about the architects of the high-level people who build the Angry Bird apps and flapping Angry Bird of who knows we're gonna be that's the whole point of the universe let's take a stand at that and the attention distracting nature of mobile phones I'll take a stand but anyway in terms of that matters much the the side effects of smartphones or the attention distraction which part well who knows you know where this is all leading it's changing so fast wax my parents do steal my sister's for hiding in the closet with a wired phone with a dial on it stop talking your friends all day right now my wife feels with my kids for talking to their friends all day on text looks the same to me it's always it's echoes of the same thing okay but you are the one of the key people architecting the hardware of this future how does that make you feel do you feel responsible do you feel excited so we're we're in a social context so there's billions of people on this planet there are literally millions of people working on technology I feel lucky to be you know what doing what I do and getting paid for it and there's an interest in it but there's so many things going on in parallel it's like the actions are so unpredictable if I wasn't here somebody else are doing the the vectors of all these different things are happening all the time you know there's a I'm sure some philosopher or meta philosophers you know wondering about how we transform our world so you can't deny the fact that these tools whether that these tools are changing our world that's right do you think it's changing for the better so some of these I read this thing recently it said the peat the two disciplines with the highest GRE scores in college are physics in philosophy right and they're both sort of trying to answer the question why is there anything right and the Philosopher's you know are on the kind of theological side and the physicists are obviously on the you know the material side and there's a hundred billion galaxies with a hundred billion stars it seems well repetitive at best so I you know there's on our way to ten billion people I mean it's hard to say what it's all for is that's what you're asking yeah I guess I guess I do tend to are significantly increases in complexity and I'm curious about how computation like like our world our physical world inherently generates mathematics it's kind of obvious right so we have X Y Z coordinates you take a sphere you make it bigger you get a surface that falls you know grows by r-squared like it generally generates mathematics and the mathematicians and the physicists have been having a lot of fun talking to each other for years and computation has been let's say relatively pedestrian like computation in terms of mathematics has been doing binary binary algebra while those guys have been gallivanting through the other realms of possibility right now recently the computation lets you do math m'q mathematical computations that are sophisticated enough that nobody understands how the answers came out right machine learning machine lying yeah it used to be you get data set you guess at a function the function is considered physics if it's predictive of new functions data sets modern you can take a large data set with no intuition about what it is and use machine learning to find a pattern that has no function right and it can arrive at results that I don't know if they're completely mathematically describable so a computation is kind of done something interesting compared to a POV plus see there's something reminiscent of that step from the basic operations of addition to taking a step towards new all networks that's reminiscent of what life on Earth and its origins was doing do you think we're creating sort of the next step in our evolution in creating artificial intelligence systems that I don't know I mean you know if there's so much in the universe already it's hard to say well I'm standing in his hold are human beings working on additional abstraction layers and possibilities yet appear so does that mean that human beings don't need dogs you know no like like there's so many things that are all simultaneously interesting and useful but you've seen through I agree you've seen great and greater level abstractions and built in artificial machines right do you think when you look at humans you think that the look of all life on earth as a single organism building this thing this machine that greater and greater levels of abstraction do you think humans are the peak the top of the food chain in this long arc of history on earth or do you think we're jus
Resume
Categories