File TXT tidak ditemukan.
Transcript
vaFhEgLoUe0 • Michael Kearns: Differential Privacy
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0206_vaFhEgLoUe0.txt
Kind: captions Language: en so is there hope for any kind of privacy in a world where a few likes can can identify you so there is differential privacy right what is differential differential privacy basically is a kind of alternate much stronger notion of privacy than these anonymization ideas and it you know it's a technical definition but like the spirit of it is we we compared to two alternate worlds okay so let's suppose I'm a researcher and I want to do you know I there's a database of medical records and one of them's yours and I want to use that database of medical records to build a predictive model for some disease so based on people's symptoms and test results and the like I want to you know build a probably a model predicting the probability two people have disease so you know this is the type of scientific research that we would like to be allowed to continue and in differential privacy you act ask a very particular counterfactual question we basically compare two alternatives one is when I do this I build this model on the database of medical records including your medical record and the other one is where I do the same exercise with the same database with just your medical record removed so basically you know it's two databases one with n records in it and one with n minus one records in it the n minus one records are the same and the only one that's missing in the second case is your medical record so differential privacy basically says that any harms that might come to you from the analysis in which your data was included are essentially Munir ly identical to the harms that would have come to you if the same analysis had done been done without your medical record included so in other words this doesn't say that bad things cannot happen to you as a result of data analysis it just says that these bad things were going to happen to you already even if your data wasn't included and to give a very concrete example right you know you know like we discussed at some length the the study that you know the in the 50s that was done that created the that established the link between smoking and lung cancer and we make the point that like well if your data was used in that analysis and you know the world kind of knew that you were a smoker because you know there was no stigma associated with smoking before that those findings real harm might have come to you as a result of that study that your data was included in in particular your insurer now might have a higher posterior belief that you might have lung cancer and raise your premiums so you've suffered economic damage but the point is is that if the same analysis been done without with all the other n minus-1 medical records and just yours missing the outcome would have been the same your your data was an idiosyncratic eleum crucial to establishing the link between smoking and lung cancer because the link between smoking and lung cancer is like a fact about the world that can be discovered with any sufficiently large database of medical records but that's a very low value of harm yes so that's showing that very little harm is done great but how what is the mechanism of differential privacy so that's the kind of beautiful statement of it well what's the mechanism by which privacy's preserve yeah so it's it's basically by adding noise to computations right so the basic idea is that every differentially private algorithm first of all or every good differentially private al but never useful one is a probabilistic algorithm so it doesn't on a given input if you gave the Elven the same input multiple times and we would give different outputs each time from some distribution and the way you achieve differential privacy algorithmically is by kind of carefully and tastefully adding noise to a computation in the right places and you know to give a very concrete example if I want to compute the average of a set of numbers right the non private way of doing that is to take those numbers and average them and release like a new mayor we precise value for the average okay in differential privacy you wouldn't do that you would first compute that average to numerical Precision's and then you'd add some noise to it right you'd add some kind of a zero mean you know Gaussian or exponential noise to it so that the actual value you output right is not the exact mean but it'll be close to the mean but it'll be close the noise the you add will sort of prove that nobody can kind of reverse engineer any particular value that went into the average so noise noise is the Savior how many algorithms can be aided by miam by adding noise yeah so I'm a relatively recent member of the differential privacy community my co-author Aaron Roth is you know really one of the founders of the field and has done a great deal of work and I've learned a tremendous amount working with him on it growing up field already yeah but it now it's pretty mature but I must admit the first time I saw the definition of deferential privacy my reaction was like wow that is a clever definition and it's really making very strong promises and my you know you know at first saw the definition in much earlier days and my first reaction was like well my worried about this definition would be that it's a great definition of privacy but that it'll be so restrictive that we won't really be able to use it like you know we won't be able to do compute many things in a differentially private way so that that's one of the great successes of the field I think isn't showing that the opposite is true and that you know most things that we know how to compute absent any privacy considerations can be computed in a differentially private way so for example pretty much all of statistics and machine learning can be done differentially privately so pick your favorite machine learning algorithm at propagation and neural networks you know card for decision trees support vector machines boosting you name it as well as classic hypothesis testing and the like and statistics none of those algorithms are differentially private in their original form all of them have mod vacations that add noise to the computation in different places in different ways that achieve differential privacy so this really means that to the extent that you know we've become a you know a scientific community very dependent on the use of machine learning and statistical modeling and data analysis we really do have a path to kind of provide privacy guarantees to those methods and and so we can still you know enjoy the benefits of kind of the data science era while providing you know rather robust privacy guarantees to individuals you