Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series
4zrU54VIK6k • 2020-01-19
Transcript preview
Open
Kind: captions Language: en today we were happy very happy to have Andrew Trask he's a brilliant writer researcher tweeter that's a word in the world of machine learning and artificial intelligence he is the author of grokking deep learning the book that I highly recommended in the lecturer on Monday he's the leader in creator of open mind which is an open source community that strives to make our algorithms our data and our world in general more privacy-preserving he is coming to us by way of Oxford but without that rich complex beautiful sophisticated British accent unfortunately he is one of the best educators and truly one of the nicest people I know so please give him a warm welcome thanks those very a generous introduction so yeah today we're going to be talking about privacy preserving AI this talks can kind of come in in two parts so the first it's going to be looking at sort of privacy tools from the context of a data scientist or a researcher like how their actual UX might change because I think that's sort of the best way to communicate some of the new technologies that are that are coming about in that context and then we're going to zoom out and look at under the assumption that these kinds of technologies become mature what is that going to do to kind of society like what sort of consequences or side effects could could these kind of tools have both positive and they give so first let's ask the question is it possible to answer questions using data that we cannot see this is going to be the key question that we look at today and let's let's start with an example so first if we wanted to answer the question what do tumors look like in humans well this is pretty complex question you know tumors are pretty complicated things so we might train a classifier if we wanted to do that we would first need to download a data set of tumor related images right so we build sophistic we start these and be able to recognize what tumors look like in in humans but this kind of data is not very easy to come by right so it's it's very rarely that it's collected it's kind of difficult to move around highly regulated and so we're probably going to buy it from a relatively small number of sources that are able to actually and managed this kind of information the scarcity and in sort of constraints around this likely to make this a relatively expensive purchase and if it's going to be an expensive purchase for us to answer this question well then we're going to find someone to sort of finance our project and if we need someone to finance a project we have to we have to come up with a way of how we're going to pay them back I'm ready create a business plan and have to find a business partner I'm gonna find a business partner we have to span all our classmates in LinkedIn you're looking for someone to start a business with us right now is because we wanted to answer the question what do tumors look like in humans what if we want to answer a different question what if we wanted to answer the question what do handwritten digits look like well this would be a totally different story right we download the data set we download a state-of-the-art training script from github we'd run it and a few minutes later we have you know a ability to classify handwritten digits with potentially superhuman ability right if such a thing exists and why is this so different between these two questions the reason is it getting access to private data data about people was really really hard and as a result we spend most of our time working on problems and tasks like this so imagine a and s if R 10 anybody's trained a classifier on M this before raise your hand I expect pretty much everybody instead of working on problems like this does anyone trying to cause fire to predict dementia diabetes Alzheimer's like is she going depression anxiety no one so why is it that we spend all our time on tests like this when these tasks these represent you know our our friends loved ones and problems in society that really really matter not to say that there are people working on this it's absolutely you know there are there whole fields dedicated to it but but sort of the machine learning community at large these tasks are pretty inaccessible in fact in order to work on one of these just getting access to the data you'd have to dedicate like a portion of your life just to getting access to it whether it's you know doing a start-up or or you know joining a hospital or or what-have-you whereas for other kinds of data they're just simply readily accessible this brings us back to our question is it possible to answer questions using data that we cannot see so in this talk we're gonna walk through a few different techniques and if the answer to this question is yes the combination of these techniques so we try to make it so that we can actually pip install access to data sets like these in the same way that we Pittman still access to other deep learning tools and the idea here is to lower the barrier to entry to increase the accessibility to some of the most important problems that we would like to address so as as Lex mentioned I lead a community called open mind which is an open source community of a little over six thousand people who are focused on sort of lowering the barrier to entry to privacy preserving AI machine learning specifically one of the tools they're working on we're talking about today is called PI seft pi sift extends the major deep learning frameworks with the ability to do privacy-preserving machine learning so specifically today we're gonna be looking at the extensions into pi torch so if pi torch people will turn on a torch yeah quite a few users and it's my hope that by walking through a few these tools it'll become sort of clear how we can start to be able to do sort of data science the act of sort of answering questions using data using data that we don't actually have direct access to right and then on the second half of the talk we're going to generalize this to answering questions even if you're not not necessarily a data scientist so first first tool is remote execution okay so let's just uh walk walk me through this so we're a jump into code for a minute but hopefully this is sort of line by line and relatively simple and even if you are from there at PI torch I think it's relatively intuitive looking at like lists of numbers and these kinds of things so up at the top we import torch as a deep learning framework sift extends towards with this thing called torch hook all it's doing is just iterating through the library and basically monkey patching in lots of new functionality and most deep learning frameworks are built around one core primitive and that core primitive is the tensor right so you know and for those of you are don't know what tensors are just think of them as nested lists of numbers for now and and that'll be good enough for this this talk but for us we introduced a second core primitive which is the worker right and a worker is a location upon a within which computation is going to be occurring alright so in this case we have a virtualized worker that is that is pointing to say a hospital data center right and the assumption that we have is that this worker will allow us to run computation inside of the data center without us actually having direct access to that worker itself right it gives us a limited sort of whitelisted set of methods that we can use on this on this remote machine so just to give you example so there's that Corp I'm talked about a minute ago we have the torch tensor so one two one three four five and the first method that we added is called just dot scent right and it does exactly what you might expect it takes the tensor serializes it sends it into the hospital data center and returns back to me a pointer as pointer is really really special and for those of you actually familiar with deep learning frameworks I hope that this will just really resonate with you because it has the full PI torch API as a part of it but whenever you execute something using this pointer instead of it running locally even though it looks like and feels like it's running locally it actually executes on the remote machine and returns back to you another pointer to the result right the idea here being that I can now coordinate remote executions remote computations without but not necessarily having to have direct access to to the machine and of course I can get a get request and will see that this is actually really really important so getting permissions around when you can do get requests and actually ask for data from a remote machine back to you so just remember that cool so this is just this is where we start so in the kind of like the Pareto principle you know 80% for 20% this is like the the first big cut right so pros Dayna remains on a remote machine we can now in theory do data science on a machine that we don't have access to that we don't know right but the problem is the first first column we want to address is how can we actually do good data science without physically seeing the data all right so it's all well and good to say oh I'm gonna train a deep Loden classifier but but the process of answering questions is inherently iterative right it's inherently sort of sort of give-and-take and I learn a little bit and I ask a little bit I learn a little bit and I ask a little bit right this brings me the second tool so search an example data again we're starting really simple it will get more complex here in a minute so in this case let's say we have what's called a grid so PI grid if PI sift is a library at PI agree is sort of the platform version so it's sort of again this is all open source Apache to stuff this is we have what's called a grid client so this is this could be a interface to a large number of data sets inside of a big hospital right and so let's say I wanted to train a classifier to do something with diabetes right so it's mean to predict diabetes or predict certain kind diabetes or certain attributed diabetes right I should be able to perform remote search I get back pointers to throw the remote information I can get back sort of detailed descriptions of what the information is without me actually looking at it right so how it was collected what the rows and columns are what the types of different information is what the various ranges of the values can take on things that allow me to do sort of remote normalization these kinds of things and then in some cases even look at samples of this data so this these samples could be sort of human curated they could be generated from again they could be they could be actually you know short snippets from the actual data set and maybe it's okay to release small amounts but not large amounts and and the reason that I highlight this this isn't like crazy complex stuff so prior to going back to school I used to work for a company called digital reasoning we did sort of on-prem data science right so we did delivered sort of AI services to corporations behind the firewall so we did you know classified information we worked with investment banks you know helping prevent insider trading and and doing data science on data that like your home team you know back in Nashville and in our case it's not able to see is really really challenging but there are some things that that can give you sort of the first big jump before you jump into kind of the more complex tools to handle some of the more more challenging use cases cool so so basic Roman execution so remote PC recalls basic sort private search and the ability to kind of look at sample data gives us enough sort of general context to be able to just start doing sort of things like feature engineering and evaluating quality okay so now the data remains the remote machine we can do some basic feature engineering and here's where things get a little more complicated okay so if you remember in the very first slide where I show you some code at the bottom I call dot get on the tensor right and what that did was it took the pointer to promote information and said hey send that information to me that is an incredibly important bottleneck right and unfortunately despite the fact that I'm doing on my remote execution if that's just naively implemented well I can just steal all the data that I want to right I just called get him whatever pointers I want and I can and there's the sort of no additional added real security so what are we gonna do about this Springs it's a tool number three called differential privacy differential privacy little higher okay cool awesome good so I'm gonna do a quick high-level overview of the intuition of differential privacy and I'm gonna jump into how it could can can and is being is looking sort of in the code and I will give you resources for kind of deeper dive and difference for privacy at the end of the talk should you be interested so differential privacy loosely stated is a field that it allows you to do statistical analysis without compromising the privacy of the data set right so it more specifically it allows you to query a database right while making certain guarantees about the privacy of the other records contained within the database so let me show you what I mean let's say we have an example database and so this is kind of the canonical DB if you look in the literature for differential privacy it'll have sort of one row for person one more row per person and one column of zeros and ones which corresponds to true and false we don't actually really care what those zeros and ones are indicating you know it could be presence of a disease could be male-female could be it's just some some sensitive attributes something that's that's worth protecting right now what we're going to do is we're going to our goal is to ensure as physical analysis doesn't compromise privacy what we're going to do is query this database right so we're gonna run some function over the entire database and we're going to look at the result and we're gonna ask a very important question we're going to ask if I were to remove someone from this database say John with the output of my function change okay and if the answer to that is no then intuitively we can we can we can say that well this this output is not conditioned on John's private information now if we could say that about everyone the Dave in the data day base right well then okay we would be a perfectly privacy-preserving query right but it might not be that useful but this intuitive definition I think is quite powerful right the notion of how can we construct queries that are invariant to removing someone or replacing them with someone else okay and the notion of the maximal amount that the output of a function can change as a result of removing or replacing one of the individuals is known as the sensitivity okay so important so if you're reading the literature you look you finds come across sensitivity that's been talking about so what do we do when we have a really sensitive function we're gonna take a bit of a sidestep for a minute I have a sister a twin sister who's finishing a PhD in political science and political science often they need to answer questions about very taboo behavior okay something that people are likely to lie about so let's say I wanted to survey everyone in this room and I wanted to answer the question what percentage of you are you know secretly serial killers right and not because like yeah not because I think any moment one of you are but because I genuinely want to understand this trend right I'm not trying to arrest people I'm not trying to sort of sort of be an instrument of the criminal justice system I'm trying to be you know sociologists or political scientist and understand this this actual trend the problem is if I sit down with each one of you in a private room and I say I promise I promise I promise I won't tell anybody right I'm still going to get a skewed distribution right make me some people are just gonna be like why would I risk telling you this is this private information and so what what sociologists can do is this this technique called randomized Response where I should about a coin you take a coin and you give it to each person before you survey them right and you've asked them to flip it twice somewhere that you cannot see so I would ask each one of you to flip a coin twice somewhere that I cannot see and then I would instruct you to if the first coin flip is a heads answer honestly but if the first coin flip is a tails answer yes or no based on the second coin flip okay so roughly half the time you'll be honest and the other half the time you'll be a you'll be giving me a perfect 50/50 coin flip and the cool thing is that what this is actually doing is taking whatever the true mean of the distribution is and averaging it with a 50/50 coin flip right so if say 55 percent of you answered yes that that you are a serial killer then I know that the true center of the distribution is actually 60% because it was 60% average with a 50/50 coin flip does that make sense however despite the fact that I can recover the center of the distribution right given enough samples each individual person has plausible deniability if you said yes it could have been because you actually are or it could have been because you just happen to flip a certain sequence of coin flips okay now this concept of adding noise to data to give plausible deniability is whether the secret weapon of differential privacy right and and the field itself is a set of mathematical proofs for trying to do this as efficiently as possible to give sort of the smallest amount of noise to get the most accurate results right with the best possible privacy protections right there is a meaningful sort of base trade-off that you you you you know you can escape there's kind of a Pareto trade-off right and we're trying to push that push that trade-off down but so the the the the field of research that is differential privacy is looking at how to add noise to data and and resulting queries to give plaza deniability to the entrance to the members of it of a database or a training dataset does that make sense now a few terms you should be familiar with so there's local and there's global differential privacy so local differential privacy adds noise to data before it's sent to the statistician so in this case when with the coin flip this was local difference or privacy it afford you the best amount of protection because you never actually reveal sort of in the clear your information to sup to someone okay and then there's global differential privacy which says okay we're to put everything in the database perform a query and then before the output of the query gets published we're gonna add a little bit of noise to the output of the query okay this tends to have a much better privacy trade-off but you have to trust the database owner to not compromise the results okay and we'll see there's some other things we can do there but with me so far this is a good good point for questions if you had any questions got it so the question is is this verifiable they get any of this this process would under privacy verifiable so that is a fantastic question and one that actually absolutely comes up in practice so first local difference or privacy the nice thing is everyone's doing it for themself right so in that sense if you're flipping your own coins and answering your own questions that's not your verification right you're kind of trusting yourself for global differential privacy stay tuned for the next tool and we'll come back to that all right so what does this look like in code so first we have a pointer to remote private data set we call dot git whoa we get big fat error right you just asked to sort of see the raw value of some private data point which you cannot do right instead pass and get epsilon to add the appropriate ment of noise so one thing I haven't mentioned yet differential privacy so I mentioned sensitivity right so sensitivity was related to the type of query the type of function that wanted to do and it's invariance to removing or replacing individual entries in the database so epsilon is a measure what we call our privacy budget all right and what our privacy budget is is saying okay what's the what's the amount of statistical uniqueness that I'm going to sort of limit what's the upper bound for the amount of systick --kw neatness that I'm going to allow to come out of this out of this database and actually I'm going to take one more size sidetrack here because I think it's really worth mentioning data anonymization anyone familiar with data anonymization come across this term before taking a document like redacting the the social security numbers and like all's kind of stuff by and large it does not work you don't remember anything else from this talk is very dangerous to do just data set anonymization okay and differential privacy in some respects is is the formal version of data automation we're instead of instead of just saying okay I'm just gonna redact out these pieces and then I'll be fine this is saying okay that we can do a lot better so for example Netflix prize Netflix machine-learning prize if you remember this a big million-dollar prize maybe some people in here competed in it so in this prize right Netflix published an anonymized data set right and that was movies and users right and they took all the movies and replaced them with numbers and it took all the users and replaced them with numbers and then we just had sparsely-populated movie ratings in this matrix right seemingly anonymous right there's no names of any kind but the problem is is that each row is statistically unique meaning it kind of is its own fingerprint and so two months after the data set with published some researchers at UT Austin I think it was I think it's UT Austin were able to go and scrape IMDB and basically create the same matrix and IMDB and then just compare the two and it turns out people that were in the movie rating we're in the movie rating and and and we're watching movies at similar times and similar similar patterns and similar tastes right and they will de anonymize this first dataset with high degree of accuracy happened again with there's a famous case of like medical records for like I think I'm I didn't bid a Massachusetts senator I think it was someone north-east being dean Onam eyes through very similar techniques so someone person goes and buys a anonymize medical they said over here that has you know birth date and zip code and this one does zip code and and gender and this one does zip code gender and whether or not you have cancer right and and when you get all these together you can start to sort of use the uniqueness and each one to relink it all back together i mean i this is so doable today to the extreme that i unfortunately no of companies whose business model is to buy anonymize datasets d anonymize them and sell market intelligence to insurance companies ooh right but it can be done okay and and the reason it can be done is that just because the data set that you are publishing and one that you are physically looking at doesn't seem like it has you know Social Security numbers stuff in it does that mean that there's enough unique statistical signal for it to be linked to something else and so when I say maximum out of epsilon epsilon is an upper bound on the statistical uniqueness that you're publishing in a data set right and so what what this tool represents is saying okay apply however much noise you need to given whatever computational graph led back to private data for this tensor right to ensure that you know to put an upper bound on the potential for link tax right now if you said epsilon0 okay then that's that's saying effectively like there's the I'm only going to allow patterns that have occurred at least twice okay so meaning meaning two different people had this pattern and thus it's not unique to either one yes so what happens if you perform the query twice so the random noise would be reran demised and sent again and you're absolutely absolutely correct so this epsilon this is how much I'm spending with this query so if I ran this three times I would spend epsilon of 0.3 so it makes sense so this is a point 1 query if I did this multiple times the absalons put some and so for any given data science project right I should I we're advocating is that you're given an epsilon budget that you're not allowed to exceed right no matter how many queries that you you could say now there's that there's another sort of subfield of difference or privacy that's looking at sort of single query approaches which is all around synthetic data sets so how can I perform sort of one query against the whole data set and create a synthetic data set that has certain invariances that are desirable right so I can do good statistics on it but then I can query this as many times as I want there basically you can't yeah anyway but we don't see it at now does that answer your question cool awesome so now you might think okay this is like a lossless cause like how can we be answering questions while protecting while while keeping cystal signal gone but like it's the difference between it's the difference between if I have a data set and I want to know what causes cancer right I could query data set and learn that smoking causes cancer without learning that individuals are are are not smokers does that make sense all right and the reason for that is is that I'm specifically looking for patterns that are occurring multiple times across different people and this actually happens to really closely mirror the type of generalization that we want in machine learning assistants anyways does that make sense like as machine learning petitioners we're actually not really interested in the one offs right I mean sometimes our models memorize things this this happens right but we're actually more interested in the things that are the things that are not specific to you I want I want the things that are gonna work you know that the heart treatments they're gonna work for everyone in this room not just I mean night you know obviously if you need a heart treatment I'd be happy that'd be cool for you to have one but like what we're T FLE interested in are things that generalize right which is why this is realistic and why with with continued effort on both tooling and and the theory side we can we can have a much better reality today cool so pros just review so first remote execution allows this allows data to remain the remote machine search and sampling we can feature engineer using toy data difference or privacy we have a formal rigorous privacy budgeting mechanism right now shoot how is the privacy budget set is it defined by the user or is it defined by the data set owner or someone else this is a really really interesting question actually so first it's definitely not set by the data scientist because that would be a bit of a conflict of interest and up at first you might say it should be the data owner okay so the hospital right it's trying to cover their butt right and make sure that their assets are protected both legally and and torchy right so they're they're trying to make money off this so there's there's there's sort of proper incentives there but the interesting thing and this gets back to your question is what happens if I have say a radiology skin in two different hospitals right and they both spend 1 epsilon worth of my privacy in each of these hospitals right that means that actually two epsilon if my private information is out there right and it just means that one person has to be clever enough to go to both places to get to join this is actually the exact same mechanism we were talking about a second ago when someone went from Netflix time TB right and so the true answer of who should be setting epsilon budgets although logistical II it's gonna be challenging we're talking about a little bit in part two of the talk but I'm going a little bit slow but okay is it should be us it should be people in it should be people around their own information right you should be setting your personal epsilon budget that makes sense that's an aspirational goal we've got a long way before we can get to that level of infrastructure around these kinds of things I'm gonna talk about that and we can definitely answer session as well but I think it theory in theory that's what we want okay the two cons we still a suit two weaknesses of this approach that we still have lack are someone asked this question he was you yeah yeah you asked the question so first the data is safe but the model is put at risk and what if we need to do a join actually actually yours is a third one which I should totally add to the slide so so first if I'm sending my computations I model into the hospital to learn how to be a better cancer classifier right my models put at risk it's kind of a bummer if like you know this is a ten million dollar healthcare model I'm just sending it to a thousand different hospitals to get learn to learn so that's potentially risky suck it what if I need to do a joint computation across multiple different data owners who don't trust each other right who sends whose data to whom right and thirdly as you pointed out how do I trust how these computations are actually happening the way that I am telling the remote machine that they should happen this brings me to my absolute favorite tool secure multi-party computation come across this before raise them high ok cool a little bit above average most machine learning people have not heard about this yet and I absolutely is this is the coolest this is the coolest thing I've learned about since learning about like AI machine learning this is there is a really really cool technique in cryptic computations you how about homework encryption you come across homework encryption okay a few more yeah this is related to that so first the kind of textbook definition is like this so if you went on Wikipedia you'd see security PC allows multiple people to combine their private inputs to compute a function without revealing their inputs to each other okay but in the context of machine learning the implication of this is multiple different individuals can share ownership of a number okay share ownership of a number show you what I mean so let's say I have the number five my happy smiling face and I split this into two shares a two and a three okay I've got two friends Mary Ann and Bobby and I give them these shares they are now the shareholders of this number okay now I'm gonna go away and this number is shared between them okay and this this gives us several desirable properties first its encrypted from the standpoint that neither Bob nor Mary Ann can tell what number is encrypted between them by looking at their own share by itself now I've for those of you who are familiar with kind of cryptographic math I'm hand waving over this a little bit this would typically be so in incre decryption would be adding the shares together modulus a large prime so these are typically look like sort of large pseudo-random numbers right but for the sake of making it sort of intuitive I've picked pseudo-random numbers that are convenient to the eyes so first these two values are encrypted and second we get shared governance meaning that we cannot decrypt these numbers or do anything with these numbers unless all of the shareholders agree okay but the truly extraordinary part is that while this number is encrypted between as individuals we can actually perform computation right so in this case let's say we wanted to multiply these shares times a encrypted number times two each person can multiply their share times two and now they have an encrypted number ten right and there's a whole variety of protocols allowing you to do different functions such as the functions needed for machine learning wild numbers are in this encrypted state okay and I'll give some more resources for you if you're interested in kind of learning more about this at the end as well now the big tiya models and data sets are just large collections of numbers which we can individually encrypt which we can individually share governance over now specifically to reference your question there's two configurations of screen PC active and passive security in the active security model you can tell if anyone does computation that you did not sort of independently authorize which is great so what does this look like in practice when you go back to the code so in this case we don't need just one worker it's not just one Hospital because we're looking to have shared governance shared ownership amongst multiple individuals so let's say we have Bob Alice and Te'o and encrypt provider which we won't go into now I can take a tensor instead of calling dot send and sending that tensor to someone else now I call dot share and that splits each value into multiple different shares and distributes those amongst the shareholders right so in this case Bob Allison tayo however in the frameworks that were working on you still get kind of the same PI torch like interface and all the cryptographic protocol happens under the hood and the idea here is to make it so that we can sort of do encrypted machine learning without you necessarily having to be a cryptographer right and vice versa cryptographers can improve the algorithms and machine then people can automatically inherit them all right so kind of classic sort of open source machine learning library making complex intelligence more accessible to people if that makes sense and what we can do on tensors we can also do in models so we can do encrypted training and encrypted prediction and we're going to get into what kind of awesome use cases this opens up in a bit and this is a nice set of features right in my opinion this is this is sort of the MVP of doing privacy preserving data science right the idea being that I could have remote access to a remote data set I can learn high-level latent patterns like like you know what causes cancer without learning whether individuals have cancer I can pull back just just that sort of high-level information with for mathematical guarantees over over you know what sort of the filter that's that's coming back through here right and I can work with datasets from multiple different data owners while making sure that each each individual data owners are protected now what's the catch okay so first is computational complexity right so encrypted computation secure NPC this this involves sending lots of information over over the network I think this is the state of the art for training or for deep learning prediction is that this is a 13 X slowdown over plain text which is inconvenient but not deadly right but you do have to understand that that assumes like it's like two AWS machines or like talking to each other you know they're relatively fast but we also haven't had any like hardware optimization to the extent that that you know Nvidia did a lot for deep learning like that there'll be you know probably like some sort of Cisco Player and it's similar for for doing kind of encrypt a or securing PC base deep learning right let's see so this brings back to kind of the fundamental question is it possible to answer questions using data we cannot see the theory is absolutely there I think that's that's something that I feel reasonably confident saying like like that sort of a theoretical frameworks that we have and actually the other thing that's really worth mentioning here is that these come from totally different fields which is why they kind of haven't been necessarily combined that much yet I'll get I'll get more into that in a second but it's my hope that that by sort of by considering what these tools can do that'll open up your eyes to the potential that in general we can have this new ability to answer questions using information that we don't actually own ourselves because from a sociological standpoint that's net new for like us as a species that makes sense if ever previously we had to have we had to have like a trusted third party who would then take all the information in themselves and make some sort of neutral decision right so we'll come to that in a second and so one of the big sort of long-term goals of our community is to make infrastructure for this secure enough and robust enough and of course in like a free Apache to open-source license kind of way that you know information on the world's most important problems will be this accessible right and we can spend sort of less time working on tasks like that and more time tasks like this so this is gonna be kind of the breaking point between sort of part 1 and part 2 part 2 will be a bit shorter but if you're interested in sort of diving deeper on the technicals of this here's a six or seven hour course that I taught just on these concepts from the tools it's free on your Nazi feel free to check it out so the question was he's asking about how I that a model can be encrypted during training is that same as homework encryption that's somewhat something else so a couple years ago there was a big burst in literature around training on encrypted data where you would homomorphic encryption data set and it turned out that some of the statistical regularities homework encryption allowed you to actually train on that data set without without decrypting it so this is similar to that except the one downside to that is that in order to use that model in the future you have to still be able to encrypt data with the same key which often is sort of constraining in practice and also there's a pretty big hit to privacy because your your training on data that inherently has a lot of noise added to it what I'm advocating for here is instead we actually encrypt both the model and the data set during training but inside the encryption inside the box right it's actually performing the same computations that it would be doing in plaintext so you don't get any degradation in accuracy and you don't get tied to one particular public/private key pair yeah yeah so specifically so the question was kind of comment on federated learning specifically Google's implementation so I think Google's implementation is is great so obviously the the fact that they've shown that this can be done hundreds of millions of users is incredibly powerful I mean even inventing the term and creating momentum in that direction I think that there's one thing that's worth mentioning is that there are two forms of federated learning one is sort of the one where your model is a federated learning sorry who got to talk about what that is okay yes I'll do that quickly so a federated learning is basically the first thing I talked about so remote execution so if everyone has a smartphone when you plug your phone in at night if you've got you know Android or iOS you plug your own up phone at night and touch the Wi-Fi you know when you text in it recommends the next word next prediction that model is trained using federated learning meaning that it learns on your device to do that better and then that model gets uploaded to the cloud as opposed to uploading all of your tweets to the cloud and training one global model does that make sense so so if all your phone a night model comes down trains locally goes like it's federated right that's that's that's basically federal earning is a nutshell and and it was pioneered by the cork team at Google and and they're there do you really fantastic work they've they've paid down a lot of the technical debt a lot of the the risk or technical risk around it and they publish really great papers outlining sort of how they do it which is fantastic what I outlined here is actually a slightly different style of federate learning because there there's federated learning with like a fixed data set and a fixed model and lots of users where the data is very ephemeral like phones are constantly logging in and logging off you know you're you're you're plugging your phone in an eye and then you're taking it out right this is sort of the the one style of federated learning that's it's really useful for like product development right so it's useful for like if you want to do a smartphone app that has a piece of intelligence in it but train that intelligence is going to be prohibitively difficult for you to get access to the data for or you want to just have a value prop of protecting privacy right that's what federated learning that South Area learning is good for what I've outlined here is a bit more exploratory federated learning where it's saying okay instead of instead of the model being hosted in the cloud and data owners showing up and making it a bit smarter every once in a while now the data is going to be hosted at a variety of different private clouds right and data scientists are gonna show up and say mmm I want to do something with that with diabetes today mmm I will do something with with studying dementia today something like that right this is much more difficult because the attack vectors for this are much larger right I'm trying to be able to answer arbitrary questions about arbitrary data sets in a protected environment right so I think yeah that's that's kind of my general thoughts does federated learning leaking information so federated learning by itself is not a secure protocol right to the extent that and that's why I sort of this ensemble of techniques that I've so the question was does federated learning leak information so it is perfectly possible for a federated learning model to simply memorize data set and then spit that back out later you have to combine it with something like differential privacy in order to be able to prevent that from happening does that make sense so just because the training is happening on a device does not mean it's not memorizing my data does that do that make sense okay so now I want to zoom out and go a little less from the kind of a data science practitioner perspective and now it take more the perspective of like a economist or scientist or someone looking kind of globally at like okay what if this becomes mature what happens alright and this is where I gets really exciting anyone entrepreneurial anyone everyone I know okay cool well this is this is the this is the part for you so the big difference is this ability to answer questions using data you can't see because as it turns out most people spend a great deal of their life just answering questions and a lot of it is involving sort of personal data I mean whether it's my new things like you know where's my water where are my keys or you know what movie should i watch tonight or or you know what kind of diet should I have to be able to sleep well right I mean a wide variety of different questions right and and we're limited and are answering ability to the information that we have right so this ability to answer question using data we don't have sociological II I think is quite quite important and there's four different areas that I want to highlight as like big groups of use cases for this kind of technology to help kind of inspire you to see where this infrastructure can go and actually before I before I jump into that has anyone been to Edinburgh Umbra cool I just see tour like the castle and stuff like that so my wife and I my wife we wouldn't say Edinburgh for the first time six months ago September September and we did the underground was it the we did a ghost to her yeah yeah we did the ghost to her and it was really cool it was something that took away from it there was this point we were standing we just walked out of the tunnels and she was pointing up some of the architecture and then she started talking about basically the cobblestone streets and why the cobblestone streets were there cobblestone streets one of the main purposes of them was to sort of lift you out of the muck and the reason there was muck was there is that they didn't have any internal plumbing and so the sewage just poured out into the street right if you live in a big city and this was the norm everywhere right and actually I think she even sort of implied to like the invention or popularization of the umbrella had less to do with actual rain a bit more with you with buckets of stuff coming down from on high which is it's a whole different world like when you think about what that is but the reason that I bring this up is that you know however many hundred years ago people were were walking through you know like sludge sewage was just everywhere right it was all over the place and people were walking through it everywhere they go and they were wondering why they got sick right and in many cases and it wasn't because they wanted it to be that way it's just because it was a natural consequence of the technology they had at the time right this is not malice this is not anyone being good or bad or or evil or whatever it's just it's just the way things were and I think that there's a strong analogy to be made with with kind of how our data is handled as a society at the moment right we've just sort of walked into a society we've had new inventions come up and new things that are practical new uses for it and now everywhere we go we're constantly spreading and spewing our data all over the place right I mean every every camera that sees me walking down the street you know goodness there's a there's a company that takes a whole of the earth by satellite every day like how the hell am I supposed to do anything without without you know everyone follow me around all the time right and I imagine that whoever it was I'm not a historian so I don't really know but whoever it was that said what if what if we ran plumbing from every single apartment Business School maybe even some public toilets underground under our city all to one location and then processed it used chemical treatments and then turn that into usable drinking water like how laughable with that event would have been just the most massive logistical infrastructure problem ever to take a working city dig up the whole thing to take already already constructed buildings and run pipes through all of them I mean so so Oxford gosh I there's a building there that's so old they don't have showers because they didn't want to run the plumbing for the head you have to ladle water over yourself it's in the Merton College it's quite quite famous right I mean the infrastructure anyway the infrastructure challenge is it just must have seen absolutely massive and so as I'm about to walk through kind of like four broad areas where things could be different theoretically based on this technology and I think it's probably going to hit you like whoa that's a lot of code or like whoa that's that's a lot of change but but I think that the need is sufficiently great I think that that I mean if you view our lives it's just one long process of answering important questions whether it's where we're going to get food or what causes cancer like making sure that the right people can answer questions without without you know data just getting spewed everywhere so that the wrong people can answer their questions right is important and yeah anyway so I know this is gonna sound like there's a certain ridiculousness to maybe what some of this will be but I hope that that you at least see that that theoretically like that the basic blocks are there and and that really what stands between us and a world that's fundamentally different is is adoption maturing of the technology and good engineering because I think you know once they know that Sir Thomas Crapper invented the toilet right I do remember that one at that point that the basics were there right and and what stood between them was was implementation adoption in engineering right and I think that's that's where we are and the best part is we have you know companies like Google that have already already paved the way with some very very large rollouts of of the early piece of this technology all right cool so what about what are the big categories when I've already talked about open data for science ok so this one is a really big deal and the reason it's a really big deal is mostly because everyone gets excited about making AI progress right everyone gets super excited about superhuman ability in X Y or Z when I started my PhD at Oxford I work for my professors name is Phil Blount some the first thing he told me when I sat my butt down on his office on my first day is this dude he said Andrew everyone's going to work on models but if you look historically the biggest jumps in progress have happened when we had new big datasets or the ability to process new big datasets and just to give a few anecdotes imagenet right imagenet GPUs allowing us to process large datasets even even things like alphago this is synthetically generated infinite datasets or or or if you don't know did you guys anyone watch the the alpha star livestream on YouTube I talked about how it had trained on like 200 years of like of StarCraft right well if you look at Watson the playing playing jeopardy right this was on the heels of a new large structured data set based on Wikipedia or if you look at Garry Kasparov and IBM's deep blue this was on the heels of the largest open data set of chess matches haven't been published online right there's this there's this echo we're like big new data set big big new breakthrough big new data set big new breakthrough right and what we're talking about here is is potentially several orders of magnitude more data relatively quickly and the reason for that is
Resume
Categories