Transcript

File TXT tidak ditemukan.
Transcript
vaFhEgLoUe0 • Michael Kearns: Differential Privacy
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0206_vaFhEgLoUe0.txt
Back Raw
Kind: captions
Language: en
so is there hope for any kind of privacy
in a world where a few likes can can
identify you so there is differential
privacy right what is differential
differential privacy basically is a kind
of alternate much stronger notion of
privacy than these anonymization ideas
and it you know it's a technical
definition but like the spirit of it is
we we compared to two alternate worlds
okay so let's suppose I'm a researcher
and I want to do you know I there's a
database of medical records and one of
them's yours and I want to use that
database of medical records to build a
predictive model for some disease so
based on people's symptoms and test
results and the like I want to you know
build a probably a model predicting the
probability two people have disease so
you know this is the type of scientific
research that we would like to be
allowed to continue and in differential
privacy you act ask a very particular
counterfactual question
we basically compare two alternatives
one is when I do this I build this model
on the database of medical records
including your medical record and the
other one is where I do the same
exercise with the same database with
just your medical record removed so
basically you know it's two databases
one with n records in it and one with n
minus one records in it the n minus one
records are the same and the only one
that's missing in the second case is
your medical record so differential
privacy basically says that any harms
that might come to you from the analysis
in which your data was included are
essentially Munir ly identical to the
harms that would have come to you if the
same analysis had done been done without
your medical record included so in other
words this doesn't say that bad things
cannot happen to you as a result of data
analysis it just says that these bad
things were going to happen to you
already
even if your data wasn't included and to
give a very concrete example right you
know you know like we discussed at some
length the the study that you know the
in the 50s that was done that created
the that established the link between
smoking and lung cancer and we make the
point that like well if your data was
used in that analysis and you know the
world kind of knew that you were a
smoker because you know there was no
stigma associated with smoking before
that those findings real harm might have
come to you as a result of that study
that your data was included in in
particular your insurer now might have a
higher posterior belief that you might
have lung cancer and raise your premiums
so you've suffered economic damage but
the point is is that if the same
analysis been done without with all the
other n minus-1 medical records and just
yours missing the outcome would have
been the same your your data was an
idiosyncratic eleum crucial to
establishing the link between smoking
and lung cancer because the link between
smoking and lung cancer is like a fact
about the world that can be discovered
with any sufficiently large database of
medical records but that's a very low
value of harm yes so that's showing that
very little harm is done great but how
what is the mechanism of differential
privacy so that's the kind of beautiful
statement of it well what's the
mechanism by which privacy's preserve
yeah so it's it's basically by adding
noise to computations right so the basic
idea is that every differentially
private algorithm first of all or every
good differentially private al but never
useful one is a probabilistic algorithm
so it doesn't on a given input if you
gave the Elven the same input multiple
times and we would give different
outputs each time from some distribution
and the way you achieve differential
privacy algorithmically is by kind of
carefully and tastefully adding noise to
a computation in the right places and
you know to give a very concrete example
if I want to compute the average of a
set of numbers right the non private way
of doing that is to take those numbers
and average them and release like a new
mayor
we precise value for the average okay in
differential privacy you wouldn't do
that you would first compute that
average to numerical Precision's and
then you'd add some noise to it right
you'd add some kind of a zero mean you
know Gaussian or exponential noise to it
so that the actual value you output
right is not the exact mean but it'll be
close to the mean but it'll be close the
noise the you add will sort of prove
that nobody can kind of reverse engineer
any particular value that went into the
average so noise noise is the Savior how
many algorithms can be aided by miam by
adding noise yeah so I'm a relatively
recent member of the differential
privacy community my co-author Aaron
Roth is you know really one of the
founders of the field and has done a
great deal of work and I've learned a
tremendous amount working with him on it
growing up field already yeah but it now
it's pretty mature but I must admit the
first time I saw the definition of
deferential privacy my reaction was like
wow that is a clever definition and it's
really making very strong promises and
my you know you know at first saw the
definition in much earlier days and my
first reaction was like well my worried
about this definition would be that it's
a great definition of privacy but that
it'll be so restrictive that we won't
really be able to use it like you know
we won't be able to do compute many
things in a differentially private way
so that that's one of the great
successes of the field I think isn't
showing that the opposite is true and
that you know most things that we know
how to compute absent any privacy
considerations can be computed in a
differentially private way so for
example pretty much all of statistics
and machine learning can be done
differentially privately so pick your
favorite machine learning algorithm at
propagation and neural networks you know
card for decision trees support vector
machines boosting you name it as well as
classic hypothesis testing and the like
and statistics
none of those algorithms are
differentially private in their original
form
all of them have mod
vacations that add noise to the
computation in different places in
different ways that achieve differential
privacy so this really means that to the
extent that you know we've become a you
know a scientific community very
dependent on the use of machine learning
and statistical modeling and data
analysis we really do have a path to
kind of provide privacy guarantees to
those methods and and so we can still
you know enjoy the benefits of kind of
the data science era while providing you
know rather robust privacy guarantees to
individuals
you