Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series

4zrU54VIK6k • 2020-01-19

Transcript preview

Open

Kind: captions
Language: en
today we were happy very happy to have
Andrew Trask he's a brilliant writer
researcher tweeter that's a word in the
world of machine learning and artificial
intelligence he is the author of
grokking deep learning the book that I
highly recommended in the lecturer on
Monday
he's the leader in creator of open mind
which is an open source community that
strives to make our algorithms our data
and our world in general more
privacy-preserving
he is coming to us by way of Oxford but
without that rich complex beautiful
sophisticated British accent
unfortunately he is one of the best
educators and truly one of the nicest
people I know so please give him a warm
welcome
thanks those very a generous
introduction
so yeah today we're going to be talking
about privacy preserving AI this talks
can kind of come in in two parts so the
first it's going to be looking at sort
of privacy tools from the context of a
data scientist or a researcher like how
their actual UX might change because I
think that's sort of the best way to
communicate some of the new technologies
that are that are coming about in that
context and then we're going to zoom out
and look at under the assumption that
these kinds of technologies become
mature what is that going to do to kind
of society like what sort of
consequences or side effects could could
these kind of tools have both positive
and they give so first let's ask the
question is it possible to answer
questions using data that we cannot see
this is going to be the key question
that we look at today and let's let's
start with an example so first if we
wanted to answer the question what do
tumors look like in humans well this is
pretty complex question
you know tumors are pretty complicated
things so we might train a classifier if
we wanted to do that we would first need
to download a data set of tumor related
images right so we build sophistic we
start these and be able to recognize
what tumors look like in in humans but
this kind of data is not very easy to
come by
right so it's it's very rarely that it's
collected it's kind of difficult to move
around highly regulated and so we're
probably going to buy it from a
relatively small number of sources that
are able to actually
and managed this kind of information the
scarcity and in sort of constraints
around this likely to make this a
relatively expensive purchase and if
it's going to be an expensive purchase
for us to answer this question well then
we're going to find someone to sort of
finance our project and if we need
someone to finance a project we have to
we have to come up with a way of how
we're going to pay them back
I'm ready create a business plan and
have to find a business partner I'm
gonna find a business partner we have to
span all our classmates in LinkedIn
you're looking for someone to start a
business with us right now is because we
wanted to answer the question what do
tumors look like in humans what if we
want to answer a different question what
if we wanted to answer the question what
do handwritten digits look like well
this would be a totally different story
right we download the data set we
download a state-of-the-art training
script from github we'd run it and a few
minutes later we have you know a ability
to classify handwritten digits with
potentially superhuman ability right if
such a thing exists and why is this so
different between these two questions
the reason is it getting access to
private data data about people was
really really hard and as a result we
spend most of our time working on
problems and tasks like this
so imagine a and s if R 10 anybody's
trained a classifier on M this before
raise your hand
I expect pretty much everybody instead
of working on problems like this does
anyone trying to cause fire to predict
dementia diabetes Alzheimer's like is
she going depression anxiety no one so
why is it that we spend all our time on
tests like this when these tasks these
represent you know our our friends loved
ones and problems in society that really
really matter not to say that there are
people working on this it's absolutely
you know there are there whole fields
dedicated to it but but sort of the
machine learning community at large
these tasks are pretty inaccessible in
fact in order to work on one of these
just getting access to the data you'd
have to dedicate like a portion of your
life just to getting access to it
whether it's you know doing a start-up
or or you know joining a hospital or or
what-have-you whereas for other kinds of
data
they're just simply readily accessible
this brings us back to our question is
it possible to answer questions using
data that we cannot see so in this talk
we're gonna walk through a few different
techniques and if the answer to this
question is yes the combination of these
techniques so we try to make it so that
we can actually pip install access to
data sets like these in the same way
that we Pittman still access to other
deep learning tools and the idea here is
to lower the barrier to entry to
increase the accessibility to some of
the most important problems that we
would like to address so as as Lex
mentioned I lead a community called open
mind which is an open source community
of a little over six thousand people who
are focused on sort of lowering the
barrier to entry to privacy preserving
AI machine learning specifically one of
the tools they're working on we're
talking about today is called PI seft pi
sift extends the major deep learning
frameworks with the ability to do
privacy-preserving machine learning so
specifically today we're gonna be
looking at the extensions into pi torch
so if pi torch people will turn on a
torch yeah quite a few users and it's my
hope that by walking through a few these
tools it'll become sort of clear how we
can start to be able to do sort of data
science the act of sort of answering
questions using data using data that we
don't actually have direct access to
right and then on the second half of the
talk we're going to generalize this to
answering questions even if you're not
not necessarily a data scientist so
first first tool is remote execution
okay so let's just uh walk walk me
through this so we're a jump into code
for a minute but hopefully this is sort
of line by line and relatively simple
and even if you are from there at PI
torch I think it's relatively intuitive
looking at like lists of numbers and
these kinds of things so up at the top
we import torch as a deep learning
framework sift extends towards with this
thing called torch hook all it's doing
is just iterating through the library
and basically monkey patching in lots of
new functionality and most deep learning
frameworks are built around one core
primitive and that core primitive is the
tensor right so you know and for those
of you are don't know what tensors are
just think of them as nested lists of
numbers for now and and that'll be good
enough for this this talk but for us we
introduced a second core primitive which
is the worker right and a worker is a
location upon a within which computation
is going to be occurring alright so in
this case we have a virtualized worker
that is that is pointing to say a
hospital data center right and the
assumption that we have is that this
worker will allow us to run computation
inside of the data center without us
actually having direct access to that
worker itself right it gives us a
limited sort of whitelisted set of
methods that we can use on this on this
remote machine so just to give you
example so there's that Corp I'm talked
about a minute ago we have the torch
tensor so one two one three four five
and the first method that we added is
called just dot scent right and it does
exactly what you might expect it takes
the tensor serializes it sends it into
the hospital data center and returns
back to me a pointer as pointer is
really really special and for those of
you actually familiar with deep learning
frameworks I hope that this will just
really resonate with you because it has
the full PI torch API as a part of it
but whenever you execute something using
this pointer instead of it running
locally even though it looks like and
feels like it's running locally it
actually executes on the remote machine
and returns back to you another pointer
to the result right the idea here being
that I can now coordinate remote
executions remote computations without
but not necessarily having to have
direct access to to the machine and of
course I can get a get request and will
see that this is actually really really
important so getting permissions around
when you can do get requests and
actually ask for data from a remote
machine back to you so just remember
that cool so this is just this is where
we start so in the kind of like the
Pareto principle you know 80% for 20%
this is like the the first big cut right
so pros Dayna remains on a remote
machine we can now in theory do data
science on a machine that we don't have
access to that we don't know right but
the problem is the first first column we
want to address is how can we actually
do good data science without physically
seeing the data all right so it's all
well and good to say oh I'm gonna train
a deep Loden classifier but but the
process of answering questions is
inherently iterative right it's
inherently sort of sort of give-and-take
and I learn a little bit and I ask a
little bit I learn a little bit and I
ask a little bit right this brings me
the second tool so search an example
data again we're starting really simple
it will get more complex here in a
minute so in this case let's say we have
what's called a grid so PI grid if PI
sift is a library at PI agree
is sort of the platform version so it's
sort of again this is all open source
Apache to stuff this is we have what's
called a grid client so this is this
could be a interface to a large number
of data sets inside of a big hospital
right and so let's say I wanted to train
a classifier to do something with
diabetes right so it's mean to predict
diabetes or predict certain kind
diabetes or certain attributed diabetes
right I should be able to perform remote
search I get back pointers to throw the
remote information I can get back sort
of detailed descriptions of what the
information is without me actually
looking at it right so how it was
collected what the rows and columns are
what the types of different information
is what the various ranges of the values
can take on things that allow me to do
sort of remote normalization these kinds
of things and then in some cases even
look at samples of this data so this
these samples could be sort of human
curated they could be generated from
again they could be they could be
actually you know short snippets from
the actual data set and maybe it's okay
to release small amounts but not large
amounts and and the reason that I
highlight this this isn't like crazy
complex stuff so prior to going back to
school I used to work for a company
called digital reasoning we did sort of
on-prem data science right so we did
delivered sort of AI services to
corporations behind the firewall so we
did you know classified information we
worked with investment banks you know
helping prevent insider trading and and
doing data science on data that like
your home team you know back in
Nashville and in our case it's not able
to see is really really challenging but
there are some things that that can give
you sort of the first big jump before
you jump into kind of the more complex
tools to handle some of the more more
challenging use cases
cool so so basic Roman execution so
remote PC recalls basic sort private
search and the ability to kind of look
at sample data gives us enough sort of
general context to be able to just start
doing sort of things like feature
engineering and evaluating quality okay
so now the data remains the remote
machine we can do some basic feature
engineering and here's where things get
a little more complicated okay so if you
remember in the very first slide where I
show you some code at the bottom I call
dot get on the tensor right
and what that did was it took the
pointer to promote information and said
hey send that information to me that is
an incredibly important bottleneck right
and unfortunately despite the fact that
I'm doing on my remote execution if
that's just naively implemented well I
can just steal all the data that I want
to right I just called get him whatever
pointers I want and I can and there's
the sort of no additional added real
security so what are we gonna do about
this
Springs it's a tool number three called
differential privacy differential
privacy
little higher okay cool awesome good
so I'm gonna do a quick high-level
overview of the intuition of
differential privacy and I'm gonna jump
into how it could can can and is being
is looking sort of in the code and I
will give you resources for kind of
deeper dive and difference for privacy
at the end of the talk should you be
interested so differential privacy
loosely stated is a field that it allows
you to do statistical analysis without
compromising the privacy of the data set
right so it more specifically it allows
you to query a database right while
making certain guarantees about the
privacy of the other records contained
within the database so let me show you
what I mean let's say we have an example
database and so this is kind of the
canonical DB if you look in the
literature for differential privacy
it'll have sort of one row for person
one more row per person and one column
of zeros and ones which corresponds to
true and false
we don't actually really care what those
zeros and ones are indicating you know
it could be presence of a disease could
be male-female could be it's just some
some sensitive attributes something
that's that's worth protecting right now
what we're going to do is we're going to
our goal is to ensure as physical
analysis doesn't compromise privacy what
we're going to do is query this database
right so we're gonna run some function
over the entire database and we're going
to look at the result and we're gonna
ask a very important question we're
going to ask if I were to remove someone
from this database say John with the
output of my function change okay and if
the answer to that is no then
intuitively we can we can we can say
that well this this output is not
conditioned on John's private
information now if we could say that
about everyone the Dave in the data day
base right well then okay we would be a
perfectly privacy-preserving query right
but it might not be that useful but this
intuitive definition I think is quite
powerful right the notion of how can we
construct queries that are invariant to
removing someone or replacing them with
someone else okay and the notion of the
maximal amount that the output of a
function can change as a result of
removing or replacing one of the
individuals is known as the sensitivity
okay so important so if you're reading
the literature you look you finds come
across sensitivity that's been talking
about
so what do we do when we have a really
sensitive function we're gonna take a
bit of a sidestep for a minute I have a
sister a twin sister who's finishing a
PhD in political science and political
science often they need to answer
questions about very taboo behavior okay
something that people are likely to lie
about so let's say I wanted to survey
everyone in this room and I wanted to
answer the question what percentage of
you are you know secretly serial killers
right and not because like yeah
not because I think any moment one of
you are but because I genuinely want to
understand this trend right I'm not
trying to arrest people I'm not trying
to sort of sort of be an instrument of
the criminal justice system I'm trying
to be you know sociologists or political
scientist and understand this this
actual trend the problem is if I sit
down with each one of you in a private
room and I say I promise I promise I
promise I won't tell anybody right I'm
still going to get a skewed distribution
right make me some people are just gonna
be like why would I risk telling you
this is this private information and so
what what sociologists can do is this
this technique called randomized
Response where I should about a coin you
take a coin and you give it to each
person before you survey them right and
you've asked them to flip it twice
somewhere that you cannot see so I would
ask each one of you to flip a coin twice
somewhere that I cannot see and then I
would instruct you to if the first coin
flip is a heads answer honestly but if
the first coin flip is a tails
answer yes or no based on the second
coin flip okay so roughly half the time
you'll be honest and the other half the
time you'll be a you'll be giving me a
perfect 50/50 coin flip and the cool
thing is that what this is actually
doing is taking whatever the true mean
of the distribution is and averaging it
with a 50/50 coin flip right so if say
55 percent of you
answered yes that that you are a serial
killer then I know that the true center
of the distribution is actually 60%
because it was 60% average with a 50/50
coin flip does that make sense however
despite the fact that I can recover the
center of the distribution right given
enough samples each individual person
has plausible deniability if you said
yes it could have been because you
actually are or it could have been
because you just happen to flip a
certain sequence of coin flips okay now
this concept of adding noise to data to
give plausible deniability is whether
the secret weapon of differential
privacy right and and the field itself
is a set of mathematical proofs for
trying to do this as efficiently as
possible to give sort of the smallest
amount of noise to get the most accurate
results right with the best possible
privacy protections right there is a
meaningful sort of base trade-off that
you you you you know you can escape
there's kind of a Pareto trade-off right
and we're trying to push that push that
trade-off down but so the the the the
field of research that is differential
privacy
is looking at how to add noise to data
and and resulting queries to give plaza
deniability to the entrance to the
members of it of a database or a
training dataset does that make sense
now
a few terms you should be familiar with
so there's local and there's global
differential privacy so local
differential privacy adds noise to data
before it's sent to the statistician so
in this case when with the coin flip
this was local difference or privacy it
afford you the best amount of protection
because you never actually reveal sort
of in the clear your information to sup
to someone okay and then there's global
differential privacy which says okay
we're to put everything in the database
perform a query and then before the
output of the query gets published we're
gonna add a little bit of noise to the
output of the query okay this tends to
have a much better privacy trade-off but
you have to trust the database owner to
not compromise the results okay and
we'll see there's some other things we
can do there but with me so far this is
a good good point for questions if you
had any questions got it so the question
is is this verifiable they get any of
this this process would
under privacy verifiable so that is a
fantastic question and one that actually
absolutely comes up in practice so first
local difference or privacy the nice
thing is everyone's doing it for
themself right so in that sense if
you're flipping your own coins and
answering your own questions
that's not your verification right
you're kind of trusting yourself for
global differential privacy stay tuned
for the next tool and we'll come back to
that all right so what does this look
like in code so first we have a pointer
to remote private data set we call dot
git whoa we get big fat error right you
just asked to sort of see the raw value
of some private data point which you
cannot do right instead pass and get
epsilon to add the appropriate ment of
noise so one thing I haven't mentioned
yet differential privacy so I mentioned
sensitivity right so sensitivity was
related to the type of query the type of
function that wanted to do and it's
invariance to removing or replacing
individual entries in the database so
epsilon is a measure what we call our
privacy budget all right and what our
privacy budget is is saying okay what's
the what's the amount of statistical
uniqueness that I'm going to sort of
limit what's the upper bound for the
amount of systick --kw neatness that I'm
going to allow to come out of this out
of this database and actually I'm going
to take one more size sidetrack here
because I think it's really worth
mentioning data anonymization anyone
familiar with data anonymization come
across this term before taking a
document like redacting the the social
security numbers and like all's kind of
stuff by and large it does not work you
don't remember anything else from this
talk is very dangerous to do just data
set anonymization okay and differential
privacy in some respects is is the
formal version of data automation we're
instead of instead of just saying okay
I'm just gonna redact out these pieces
and then I'll be fine
this is saying okay that we can do a lot
better so for example Netflix prize
Netflix machine-learning prize if you
remember this a big million-dollar prize
maybe some people in here competed in it
so in this prize right
Netflix published an anonymized data set
right and that was movies and users
right and they took all the movies and
replaced them with numbers and it took
all the users and replaced them with
numbers and then we just had
sparsely-populated movie ratings in this
matrix right
seemingly
anonymous right there's no names of any
kind but the problem is is that each row
is statistically unique meaning it kind
of is its own fingerprint and so two
months after the data set with published
some researchers at UT Austin I think it
was I think it's UT Austin were able to
go and scrape IMDB and basically create
the same matrix and IMDB and then just
compare the two and it turns out people
that were in the movie rating we're in
the movie rating and and and we're
watching movies at similar times and
similar similar patterns and similar
tastes right and they will de anonymize
this first dataset with high degree of
accuracy happened again with there's a
famous case of like medical records for
like I think I'm I didn't bid a
Massachusetts senator I think it was
someone north-east being dean Onam eyes
through very similar techniques so
someone person goes and buys a anonymize
medical they said over here that has you
know birth date and zip code and this
one does zip code and and gender and
this one does zip code gender and
whether or not you have cancer right and
and when you get all these together you
can start to sort of use the uniqueness
and each one to relink it all back
together i mean i this is so doable
today to the extreme that i
unfortunately no of companies whose
business model is to buy anonymize
datasets d anonymize them and sell
market intelligence to insurance
companies ooh right but it can be done
okay and and the reason it can be done
is that just because the data set that
you are publishing and one that you are
physically looking at doesn't seem like
it has you know Social Security numbers
stuff in it does that mean that there's
enough unique statistical signal for it
to be linked to something else and so
when I say maximum out of epsilon
epsilon is an upper bound on the
statistical uniqueness that you're
publishing in a data set right and so
what what this tool represents is saying
okay apply however much noise you need
to given whatever computational graph
led back to private data for this tensor
right to ensure that you know to put an
upper bound on the potential for link
tax right now if you said epsilon0 okay
then that's that's saying effectively
like there's the I'm only going to allow
patterns that have occurred at least
twice okay so meaning meaning two
different people had this pattern and
thus it's not unique to either one yes
so what happens if you perform the query
twice so the random noise would be reran
demised and sent again and you're
absolutely absolutely correct so this
epsilon this is how much I'm spending
with this query so if I ran this three
times I would spend epsilon of 0.3 so it
makes sense so this is a point 1 query
if I did this multiple times the
absalons put some and so for any given
data science project right
I should I we're advocating is that
you're given an epsilon budget that
you're not allowed to exceed right no
matter how many queries that you you
could say now there's that there's
another sort of subfield of difference
or privacy that's looking at sort of
single query approaches which is all
around synthetic data sets so how can I
perform sort of one query against the
whole data set and create a synthetic
data set that has certain invariances
that are desirable right so I can do
good statistics on it but then I can
query this as many times as I want there
basically you can't yeah anyway but we
don't see it at now does that answer
your question
cool awesome so now you might think okay
this is like a lossless cause like how
can we be answering questions while
protecting while while keeping cystal
signal gone but like it's the difference
between it's the difference between if I
have a data set and I want to know what
causes cancer right
I could query data set and learn that
smoking causes cancer without learning
that individuals are are are not smokers
does that make sense
all right and the reason for that is is
that I'm specifically looking for
patterns that are occurring multiple
times across different people and this
actually happens to really closely
mirror the type of generalization that
we want in machine learning assistants
anyways does that make sense like as
machine learning petitioners we're
actually not really interested in the
one offs right I mean sometimes our
models memorize things this this happens
right but we're actually more interested
in the things that are the things that
are not specific to you I want I want
the things that are gonna work you know
that the heart treatments they're gonna
work for everyone in this room not just
I mean night you know obviously if you
need a heart treatment I'd be happy
that'd be cool for you to have one but
like what we're T FLE interested in
are things that generalize right which
is why this is realistic and why with
with continued effort on both tooling
and and the theory side we can we can
have a much better reality today
cool so pros just review so first remote
execution allows this allows data to
remain the remote machine search and
sampling we can feature engineer using
toy data difference or privacy we have a
formal rigorous privacy budgeting
mechanism right now shoot how is the
privacy budget set is it defined by the
user or is it defined by the data set
owner or someone else this is a really
really interesting question actually so
first it's definitely not set by the
data scientist because that would be a
bit of a conflict of interest and up at
first you might say it should be the
data owner okay so the hospital right
it's trying to cover their butt right
and make sure that their assets are
protected both legally and and torchy
right so they're they're trying to make
money off this so there's there's
there's sort of proper incentives there
but the interesting thing and this gets
back to your question is what happens if
I have say a radiology skin in two
different hospitals right and they both
spend 1 epsilon worth of my privacy in
each of these hospitals right that means
that actually two epsilon if my private
information is out there right and it
just means that one person has to be
clever enough to go to both places to
get to join this is actually the exact
same mechanism we were talking about a
second ago when someone went from
Netflix time TB right and so the true
answer of who should be setting epsilon
budgets although logistical II it's
gonna be challenging we're talking about
a little bit in part two of the talk but
I'm going a little bit slow but okay is
it should be us it should be people in
it should be people around their own
information right you should be setting
your personal epsilon budget that makes
sense that's an aspirational goal we've
got a long way before we can get to that
level of infrastructure around these
kinds of things I'm gonna talk about
that and we can definitely answer
session as well but I think it
theory in theory that's what we want
okay the two cons we still a suit two
weaknesses of this approach that we
still have lack are someone asked this
question he was you yeah yeah you asked
the question so first the data is safe
but the model is put at risk and what if
we need to do a join actually actually
yours is a third one which I should
totally add to the slide so so first if
I'm sending my computations I model into
the hospital to learn how to be a better
cancer classifier right my models put at
risk it's kind of a bummer if like you
know this is a ten million dollar
healthcare model I'm just sending it to
a thousand different hospitals to get
learn to learn so that's potentially
risky suck it what if I need to do a
joint computation across multiple
different data owners who don't trust
each other right who sends whose data to
whom right and thirdly as you pointed
out how do I trust how these
computations are actually happening the
way that I am telling the remote machine
that they should happen
this brings me to my absolute favorite
tool secure multi-party computation come
across this before
raise them high ok cool a little bit
above average most machine learning
people have not heard about this yet and
I absolutely is this is the coolest this
is the coolest thing I've learned about
since learning about like AI machine
learning this is there is a really
really cool technique in cryptic
computations you how about homework
encryption you come across homework
encryption okay a few more yeah this is
related to that
so first the kind of textbook definition
is like this so if you went on Wikipedia
you'd see security PC allows multiple
people to combine their private inputs
to compute a function without revealing
their inputs to each other okay but in
the context of machine learning the
implication of this is multiple
different individuals can share
ownership of a number okay share
ownership of a number show you what I
mean so let's say I have the number five
my happy smiling face and I split this
into two shares a two and a three
okay I've got two friends Mary Ann and
Bobby and I give them these shares they
are now the shareholders of this number
okay now I'm gonna go away and this
number is shared between them okay and
this this gives us several desirable
properties first its encrypted from the
standpoint that neither Bob nor Mary Ann
can tell what number is encrypted
between them by looking at their own
share by itself
now I've for those of you who are
familiar with kind of cryptographic math
I'm hand waving over this a little bit
this would typically be so in incre
decryption would be adding the shares
together modulus a large prime so these
are typically look like sort of large
pseudo-random numbers right but for the
sake of making it sort of intuitive I've
picked pseudo-random numbers that are
convenient to the eyes so first these
two values are encrypted and second we
get shared governance meaning that we
cannot decrypt these numbers or do
anything with these numbers unless all
of the shareholders agree okay
but the truly extraordinary part is that
while this number is encrypted between
as individuals we can actually perform
computation right so in this case let's
say we wanted to multiply these shares
times a encrypted number times two each
person can multiply their share times
two and now they have an encrypted
number ten right and there's a whole
variety of protocols allowing you to do
different functions such as the
functions needed for machine learning
wild numbers are in this encrypted state
okay
and I'll give some more resources for
you if you're interested in kind of
learning more about this at the end as
well now the big tiya models and data
sets are just large collections of
numbers which we can individually
encrypt which we can individually share
governance over now specifically to
reference your question there's two
configurations of screen PC active and
passive security in the active security
model you can tell if anyone does
computation that you did not sort of
independently authorize which is great
so what does this look like in practice
when you go back to the code so in this
case we don't need just one worker it's
not just one Hospital because we're
looking to have shared governance shared
ownership amongst multiple individuals
so let's say we have Bob Alice and Te'o
and encrypt provider which we won't go
into now I can take a tensor instead of
calling dot send and sending that tensor
to someone else now I call dot share and
that splits each value into multiple
different shares and distributes those
amongst the shareholders right so in
this case Bob Allison tayo however in
the frameworks that were working on you
still get kind of the same PI torch like
interface and all the cryptographic
protocol happens under the hood and the
idea here is to make it so that we can
sort of do encrypted machine learning
without you necessarily having to be a
cryptographer right and vice versa
cryptographers can improve the
algorithms and machine then people can
automatically inherit them all right so
kind of classic sort of open source
machine learning library making complex
intelligence more accessible to people
if that makes sense and what we can do
on tensors we can also do in models so
we can do encrypted training and
encrypted prediction and we're going to
get into what kind of awesome use cases
this opens up in a bit
and this is a nice set of features right
in my opinion this is this is sort of
the MVP of doing privacy preserving data
science right the idea being that I
could have remote access to a remote
data set I can learn high-level latent
patterns like like you know what causes
cancer without learning whether
individuals have cancer I can pull back
just just that sort of high-level
information with for mathematical
guarantees over over you know what sort
of the filter that's that's coming back
through here right and I can work with
datasets from multiple different data
owners while making sure that each each
individual data owners are protected now
what's the catch okay so first is
computational complexity right so
encrypted computation secure NPC this
this involves sending lots of
information over over the network I
think this is the state of the art for
training or for deep learning prediction
is that this is a 13 X slowdown over
plain text which is inconvenient but not
deadly right but you do have to
understand that that assumes like it's
like two AWS machines or like talking to
each other you know they're relatively
fast but we also haven't had any like
hardware optimization to the extent that
that you know Nvidia did a lot for deep
learning like that there'll be you know
probably like some sort of Cisco Player
and it's similar for for doing kind of
encrypt a or securing PC base deep
learning right let's see so this brings
back to kind of the fundamental question
is it possible to answer questions using
data we cannot see the theory is
absolutely there I think that's that's
something that I feel reasonably
confident saying like like that sort of
a theoretical frameworks that we have
and actually the other thing that's
really worth mentioning here is that
these come from totally different fields
which is why they kind of haven't been
necessarily combined that much yet I'll
get I'll get more into that in a second
but it's my hope that that by sort of by
considering what these tools can do
that'll open up your eyes to the
potential that in general we can have
this new ability to answer questions
using information that we don't actually
own ourselves because from a
sociological standpoint that's net new
for like us as a species that makes
sense if ever previously we had to have
we had to have like a trusted third
party who would then take all the
information in themselves
and make some sort of neutral decision
right so we'll come to that in a second
and so one of the big sort of long-term
goals of our community is to make
infrastructure for this secure enough
and robust enough and of course in like
a free Apache to open-source license
kind of way that you know information on
the world's most important problems will
be this accessible right and we can
spend sort of less time working on tasks
like that and more time tasks like this
so this is gonna be kind of the breaking
point between sort of part 1 and part 2
part 2 will be a bit shorter but if
you're interested in sort of diving
deeper on the technicals of this here's
a six or seven hour course that I taught
just on these concepts from the tools
it's free on your Nazi feel free to
check it out so the question was he's
asking about how I that a model can be
encrypted during training is that same
as homework encryption that's somewhat
something else so a couple years ago
there was a big burst in literature
around training on encrypted data where
you would homomorphic encryption data
set and it turned out that some of the
statistical regularities homework
encryption allowed you to actually train
on that data set without without
decrypting it so this is similar to that
except the one downside to that is that
in order to use that model in the future
you have to still be able to encrypt
data with the same key which often is
sort of constraining in practice and
also there's a pretty big hit to privacy
because your your training on data that
inherently has a lot of noise added to
it what I'm advocating for here is
instead we actually encrypt both the
model and the data set during training
but inside the encryption inside the box
right it's actually performing the same
computations that it would be doing in
plaintext so you don't get any
degradation in accuracy and you don't
get tied to one particular
public/private key pair yeah yeah so
specifically so the question was kind of
comment on federated learning
specifically Google's implementation so
I think Google's implementation is is
great so obviously the the fact that
they've shown that this can be done
hundreds of millions of users is
incredibly powerful I mean even
inventing the term and creating momentum
in that direction I think that there's
one thing that's worth mentioning is
that there are two forms of federated
learning one is sort of the one where
your model is a federated learning sorry
who got to talk about what that is okay
yes I'll do that quickly so a federated
learning is basically the first thing I
talked about so remote execution so if
everyone has a smartphone when you plug
your phone in at night if you've got you
know Android or iOS you plug your own up
phone at night and touch the Wi-Fi you
know when you text in it recommends the
next word next prediction that model is
trained using federated learning meaning
that it learns on your device to do that
better and then that model gets uploaded
to the cloud as opposed to uploading all
of your tweets to the cloud and training
one global model does that make sense so
so if all your phone a night model comes
down trains locally goes like it's
federated right that's that's that's
basically federal earning is a nutshell
and and it was pioneered by the cork
team at Google and and they're there do
you really fantastic work they've
they've paid down a lot of the technical
debt a lot of the the risk or technical
risk around it and they publish really
great papers outlining sort of how they
do it which is fantastic what I outlined
here is actually a slightly different
style of federate learning because there
there's federated learning with like a
fixed data set and a fixed model and
lots of users where the data is very
ephemeral like phones are constantly
logging in and logging off you know
you're you're you're plugging your phone
in an eye and then you're taking it out
right this is sort of the the one style
of federated learning that's it's really
useful for like product development
right so it's useful for like if you
want to do a smartphone app that has a
piece of intelligence in it but train
that intelligence is going to be
prohibitively difficult for you to get
access to the data for or you want to
just have a value prop of protecting
privacy right that's what federated
learning that South Area learning is
good for what I've outlined here is a
bit more exploratory federated learning
where it's saying okay instead of
instead of the model being hosted in the
cloud and data owners showing up and
making it a bit smarter every once in a
while now the data is going to be hosted
at a variety of different private clouds
right and data scientists are gonna show
up and say mmm I want to do something
with that with diabetes today mmm I will
do something with with studying dementia
today something like that right this is
much more difficult
because the attack vectors for this are
much larger right I'm trying to be able
to answer arbitrary questions about
arbitrary data sets in a protected
environment right so I think yeah that's
that's kind of my general thoughts does
federated learning leaking information
so federated learning by itself is not a
secure protocol right to the extent that
and that's why I sort of this ensemble
of techniques that I've so the question
was does federated learning leak
information so it is perfectly possible
for a federated learning model to simply
memorize data set and then spit that
back out later you have to combine it
with something like differential privacy
in order to be able to prevent that from
happening does that make sense so just
because the training is happening on a
device does not mean it's not memorizing
my data does that do that make sense
okay so now I want to zoom out and go a
little less from the kind of a data
science practitioner perspective and now
it take more the perspective of like a
economist or scientist or someone
looking kind of globally at like okay
what if this becomes mature what happens
alright and this is where I gets really
exciting anyone entrepreneurial anyone
everyone I know okay cool well this is
this is the this is the part for you so
the big difference is this ability to
answer questions using data you can't
see because as it turns out most people
spend a great deal of their life just
answering questions and a lot of it is
involving sort of personal data I mean
whether it's my new things like you know
where's my water where are my keys or
you know what movie should i watch
tonight or or you know what kind of diet
should I have to be able to sleep well
right I mean a wide variety of different
questions right and and we're limited
and are answering ability to the
information that we have right so this
ability to answer question using data we
don't have sociological II I think is
quite quite important and there's four
different areas that I want to highlight
as like big groups of use cases for this
kind of technology to help kind of
inspire you to see where this
infrastructure can go and actually
before I before I jump into that has
anyone been to Edinburgh Umbra cool I
just see tour like the castle and stuff
like that
so my wife and I my wife we wouldn't say
Edinburgh for the first time six months
ago
September September and we did the
underground
was it the we did a ghost to her yeah
yeah we did the ghost to her and it was
really cool it was something that took
away from it there was this point we
were standing we just walked out of the
tunnels and she was pointing up some of
the architecture and then she started
talking about basically the cobblestone
streets and why the cobblestone streets
were there cobblestone streets one of
the main purposes of them was to sort of
lift you out of the muck and the reason
there was muck was there is that they
didn't have any internal plumbing and so
the sewage just poured out into the
street right if you live in a big city
and this was the norm everywhere right
and actually I think she even sort of
implied to like the invention or
popularization of the umbrella had less
to do with actual rain a bit more with
you with buckets of stuff coming down
from on high which is it's a whole
different world like when you think
about what that is but the reason that I
bring this up is that you know however
many hundred years ago people were were
walking through you know like sludge
sewage was just everywhere right it was
all over the place and people were
walking through it everywhere they go
and they were wondering why they got
sick right and in many cases and it
wasn't because they wanted it to be that
way it's just because it was a natural
consequence of the technology they had
at the time right this is not malice
this is not anyone being good or bad or
or evil or whatever it's just it's just
the way things were and I think that
there's a strong analogy to be made with
with kind of how our data is handled as
a society at the moment right we've just
sort of walked into a society we've had
new inventions come up and new things
that are practical new uses for it and
now everywhere we go we're constantly
spreading and spewing our data all over
the place right I mean every every
camera that sees me walking down the
street you know goodness there's a
there's a company that takes a whole
of the earth by satellite every day like
how the hell am I supposed to do
anything without without you know
everyone follow me around all the time
right and I imagine that whoever it was
I'm not a historian so I don't really
know but whoever it was that said what
if what if we ran plumbing from every
single apartment Business School maybe
even some public toilets underground
under our city all to one location and
then processed it used chemical
treatments and then turn that into
usable drinking water like how laughable
with that event would have been just the
most massive logistical infrastructure
problem ever to take a working city dig
up the whole thing to take already
already constructed buildings and run
pipes through all of them I mean so so
Oxford gosh I there's a building there
that's so old they don't have showers
because they didn't want to run the
plumbing for the head you have to ladle
water over yourself it's in the Merton
College it's quite quite famous right I
mean the infrastructure anyway the
infrastructure challenge is it just must
have seen absolutely massive and so as
I'm about to walk through kind of like
four broad areas where things could be
different theoretically based on this
technology and I think it's probably
going to hit you like whoa that's a lot
of code or like whoa that's that's a lot
of change but but I think that the need
is sufficiently great I think that that
I mean if you view our lives it's just
one long process of answering important
questions whether it's where we're going
to get food or what causes cancer like
making sure that the right people can
answer questions without without you
know data just getting spewed everywhere
so that the wrong people can answer
their questions right is important and
yeah anyway so I know this is gonna
sound like there's a certain
ridiculousness to maybe what some of
this will be but I hope that that you at
least see that that theoretically like
that the basic blocks are there and and
that really what stands between us and a
world that's fundamentally different is
is adoption maturing of the technology
and good engineering because I think you
know
once they know that Sir Thomas Crapper
invented the toilet right I do remember
that one at that point that the basics
were there right and and what stood
between them was was implementation
adoption in engineering right and I
think that's that's where we are and the
best part is we have you know companies
like Google that have already already
paved the way with some very very large
rollouts of of the early piece of this
technology all right cool
so what about what are the big
categories when I've already talked
about open data for science ok so this
one is a really big deal and the reason
it's a really big deal is mostly because
everyone gets excited about making AI
progress right
everyone gets super excited about
superhuman ability in X Y or Z when I
started my PhD at Oxford I work for my
professors name is Phil Blount some the
first thing he told me when I sat my
butt down on his office on my first day
is this dude he said Andrew everyone's
going to work on models but if you look
historically the biggest jumps in
progress have happened when we had new
big datasets or the ability to process
new big datasets and just to give a few
anecdotes imagenet right imagenet GPUs
allowing us to process large datasets
even even things like alphago this is
synthetically generated infinite
datasets or or or if you don't know did
you guys anyone watch the the alpha star
livestream on YouTube
I talked about how it had trained on
like 200 years of like of StarCraft
right well if you look at Watson the
playing playing jeopardy right this was
on the heels of a new large structured
data set based on Wikipedia or if you
look at Garry Kasparov and IBM's deep
blue this was on the heels of the
largest open data set of chess matches
haven't been published online right
there's this there's this echo we're
like big new data set big big new
breakthrough big new data set big new
breakthrough right and what we're
talking about here is
is potentially several orders of
magnitude more data relatively quickly
and the reason for that is

Resume

Berikut adalah rangkuman komprehensif dari transkrip video yang Anda berikan, disusun secara profesional untuk memudahkan pemahaman.

***

# Masa Depan AI yang Menghargai Privasi: Memahami PySyft, Differential Privacy, dan Secure Multi-Party Computation

### Inti Sari (Executive Summary)
Video ini membahas presentasi oleh Andrew Trask mengenai **Privacy-Preserving AI**, sebuah pendekatan revolusioner yang memungkinkan ilmuwan data untuk menjawab pertanyaan kompleks dan melatih model tanpa pernah melihat data mentah yang sensitif. Trask menjelaskan bagaimana teknologi seperti *Remote Execution*, *Differential Privacy*, dan *Secure Multi-Party Computation (SMPC)* dapat mengatasi hambatan hukum dan etis dalam mengakses data pribadi, serta bagaimana teknologi ini dapat mendemokratisasi akses data untuk penelitian medis dan meningkatkan kualitas sistem rekomendasi demi kebaikan manusia.

---

### Poin-Poin Kunci (Key Takeaways)
*   **Remote Execution dengan PySyft:** Memungkinkan komputasi dilakukan di tempat data berada (server rumah sakit/perusahaan) tanpa mengirim data mentah ke ilmuwan data.
*   **Differential Privacy (DP):** Teknik menambahkan *noise* (gangguan) ke dalam data untuk melindungi identitas individu, dengan pengelolaan "Privacy Budget" (Epsilon) untuk menyeimbangkan akurasi dan privasi.
*   **Secure Multi-Party Computation (SMPC):** Metode enkripsi di mana data dibagi menjadi potongan (*shares*) di berbagai pihak, memungkinkan komputasi dilakukan pada data terenkripsi tanpa ada satu pihak pun yang melihat data lengkap.
*   **Structured Transparency:** Konsep baru yang menggabungkan enkripsi input dan output privasi untuk menciptakan layanan yang diauditabel dan aman (misalnya diagnosa medis tanpa dokter melihat file pasien).
*   **Dampak Sosial:** Teknologi ini berpotensi memecahkan masalah "data silo" di kesehatan dan menciptakan sistem rekomendasi yang lebih manusiawi (fokus pada kesejahteraan, bukan sekadar *engagement*).

---

### Rincian Materi (Detailed Breakdown)

#### 1. Pengantar: Masalah Akses Data dalam AI
Andrew Trask (penulis *Grokking Deep Learning* dan pemimpin OpenMinded) membuka diskusi dengan permasalahan utama dalam Machine Learning (ML) modern: **kesenjangan antara data yang mudah diakses dan data yang penting.**
*   **Data Publik vs. Privat:** Komunitas ML sering berfokus pada data publik (seperti MNIST/tulisan tangan) karena mudah diunduh. Namun, data kritis untuk kesehatan (seperti tumor, demensia) bersifat privat, sulit diakses, dan terikat regulasi.
*   **Tujuan:** Membuat dataset privat secepat akses dataset publik, seolah-olah kita bisa melakukan `pip install` pada data rumah sakit.
*   **Solusi Awal - PySyft & Remote Execution:**
    *   PySyft memperluas kerangka kerja DL seperti PyTorch dengan menambahkan primitif "Worker".
    *   Konsepnya adalah mengirim model ke data (`.send()`), bukan mengunduh data. Ilmuwan data menerima "pointer" untuk melakukan operasi jarak jauh, dan hanya bisa mengambil hasil akhir (`.get()`) dengan izin.

#### 2. Pencarian Data dan Differential Privacy
Melakukan *data science* tanpa melihat data adalah proses iteratif. Untuk mendukung ini, diperlukan alat pencarian dan pemahaman data tanpa melanggar privasi.
*   **PyGrid:** Platform sumber terbuka untuk menghubungkan klien (ilmuwan data) dengan berbagai dataset terpencar (misal: di berbagai rumah sakit).
*   **Kemampuan Pencarian:** Pengguna bisa melakukan pencarian jarak jauh untuk mendapatkan deskripsi skema, tipe data, rentang nilai, dan sampel data tanpa mengakses keseluruhan dataset.
*   **Differential Privacy (DP):**
    *   **Konsep:** Menambahkan *noise* acak ke data atau hasil query sehingga outputnya tidak dapat mengungkapkan apakah data individu tertentu ada di dalam dataset.
    *   **Lokal vs Global:** DP Lokal (noise ditambahkan di perangkat pengguna) lebih aman namun sulit diverifikasi. DP Global (noise ditambahkan di server database) lebih efisien namun memerlukan kepercayaan pada pemilik database.
    *   **Ancaman Anonimasi:** Hapus nama (PII) tidak cukup. Contoh *Netflix Prize* dan data sensus menunjukkan bahwa data yang "dianonimkan" masih bisa diidentifikasi (*de-anonymized*) dengan teknik *fingerprinting*.
    *   **Epsilon (Privacy Budget):** Batas atas kebocoran informasi statistik yang diperbolehkan. Epsilon terakumulasi setiap kali query dilakukan. Jika habis, akses ditutup.

#### 3. Secure Multi-Party Computation (SMPC)
Pendekatan sebelumnya (mengirim model ke data) memiliki risiko, seperti model mahal yang dicuri atau kesulitan melakukan komputasi gabungan antar pihak yang tidak saling percaya.
*   **Definisi SMPC:** Beberapa pihak menggabungkan input privat mereka untuk menghitung fungsi tanpa mengungkapkan input mereka kepada orang lain.
*   **Mekanisme:** Data (misal angka 5) dipecah menjadi *shares* (misal: 2 dan 3) dan disebar ke pemegang yang berbeda. Tidak ada pemegang yang tahu nilai aslinya.
*   **Komputasi Terenkripsi:** Operasi matematika (perkalian, penjumlahan) dapat dilakukan pada *shares* ini saat dalam keadaan terenkripsi.
*   **Keterbatasan:** Teknologi ini membutuhkan biaya komputasi yang tinggi. Pelatihan *deep learning* dengan SMPC bisa berjalan 13x lebih lambat dibandingkan teks biasa (plaintext).

#### 4. Perbandingan Teknologi dan Federated Learning
*   **SMPC vs Homomorphic Encryption:** *Homomorphic Encryption* memungkinkan komputasi pada data terenkripsi, tetapi seringkali menurunkan akurasi karena penambahan noise. Pendekatan SMPC/Trask melakukan komputasi *plaintext* di dalam "kotak" terenkripsi, menjaga akurasi model.
*   **Federated Learning (FL):**
    *   **Google FL:** Fokus pada produk konsumen (ponsel), di mana pengguna sementara dan model dilatih di perangkat edge.
    *   **Exploratory FL:** Pendekatan untuk data cloud privat (medis/korporat). Ini jauh lebih sulit karena serangan pada data statis lebih berisiko dibanding data sementara di ponsel.

#### 5. Dampak Sosial dan "Structured Transparency"
Trask menggambarkan masyarakat saat ini seperti kota tanpa saluran air; data tersebar di mana-mana (kamera, satelit) dan kita perlu infrastruktur untuk mengelolanya dengan transparan dan aman.

## Kesimpulan & Pesan Penutup
Andrew Trask menunjukkan bahwa teknologi *Privacy-Preserving AI* seperti PySyft, *Differential Privacy*, dan SMPC menawarkan solusi nyata atas dilema akses data sensitif tanpa mengorbankan privasi individu. Dengan mengubah paradigma dari pemindahan data ke pemindahan model, alat-alat ini mampu meruntuhkan tembok *data silo* di sektor krusial seperti kesehatan sekaligus menjamin keamanan dan kepatuhan regulasi. Pada akhirnya, penerapan *Structured Transparency* menjadi kunci untuk membangun masa depan di mana pemanfaatan data demi kemaslahatan manusia dapat berjalan beriringan dengan standar etis yang ketat.

Read

file updated 2026-02-13 13:23:14 UTC