Transcript
W7wJDJ56c88 • DeepMind solves protein folding | AlphaFold 2
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0470_W7wJDJ56c88.txt
Kind: captions
Language: en
i think it's fair to say that this year
2020 has thrown
quite a few challenges at human
civilization so it's really nice to get
some positive news about the truly
marvelous accomplishments of engineering
and science
one was spacex i would argue launching a
new era
of space exploration and now a couple of
days ago
deepmind has announced that its second
iteration of the alphavote system
has quote unquote solved the 50 year old
grand challenge problem of protein
folding
solved here means that these
computational methods
were able to achieve prediction
performance
similar to much slower much more
expensive experimental methods
like x-ray crystallography in 2018 which
is the previous iteration of the casper
competition alpha fold achieved
a score of 58 on the hardest class of
proteins
and this year it achieved a score of 87
which is a huge improvement
and it's still 26 points better than the
closest competition
so this is definitely a big leap but
it's also fair to say
that the internet is full of hype about
this breakthrough
so let me indulge in the fun a bit some
of it is definitely a little bit
subjective
but i think the case could be made on
the life science side that this is the
biggest advancements in structural
biology of the past
one or two decades and in my field of
artificial intelligence
i think a strong case could be made that
this is one of the biggest advancements
in recent history of the field
so of course the competition is pretty
steep and i talk with excitement
about each of these entries of course
the imagenet moment itself or the
alexnet moment that launched a deep
learning revolution
in the space of computer vision so many
people are comparing now
this breakthrough of alpha fold 2
to the image that moment but now in the
life sciences field i think the good old
argument
over beers about uh which is the biggest
breakthrough comes down to
the importance you place on how much
real world
direct impact a breakthrough has of
course alex net
was ultimately on a toy data set of very
simplistic image classification problem
which does not have a direct application
to the real world
but it did demonstrate the ability of
deep neural networks to
learn from a large amount of data in a
supervised way but anyway this is uh
probably a very long conversation over
many beers of uh alpha zero with
reinforcement learning self-play
obviously in contention for the biggest
breakthrough the recent breakthroughs in
the application of transformers in the
natural language processing space
with gpt-3 being the most kind of recent
iteration of state-of-the-art
performance the actual deployment of
robots in the field
used by real humans which is tesla
autopilot you know
deployment of massive fleet learning of
massive machine learning
in safety critical systems and then
other kinds of robots
like the google self-driving car waymo
systems
that are taking in a further leap of
removing the human from the picture
being able to drive the car autonomously
without human supervision
smart speakers in the home there's a lot
of actual in the wild
natural language processing that i think
doesn't get enough credit from the
artificial intelligence community how
much amazing stuff is there and
depending how much value you put
in engineering achievements especially
in the hardware space boston dynamics
with
spot many spot robot is just
one could argue is one of the great
accomplishments in the artificial
intelligence field
especially when you maybe look 20 and 50
years down the line
when the entire world is populated by
robot dogs and the humans have gone
extinct
anyway i say all that for fun but really
this is one of the big breakthroughs in
our field
and something to truly be excited about
and i'll talk about some of the possible
future impact i see here from this
breakthrough
in just a couple of slides here anyway
my prediction is that
there will be at least one potentially
several nobel prizes that will
result in derivative work launched
directly with these computational
methods
it's kind of exciting to think that it's
possible also that
we'll see a first nobel prize that is
awarded
where much of the work is done by a
machine learning system
of course the nobel prize is awarded to
the humans behind the system but it's
exciting to think that a computational
approach or machine learning system
will play a big role in a nobel prize
level discovery in the field like
medicine and physiology or
chemistry or even physics okay
let's talk a bit about proteins and
protein folding why this whole space is
really fascinating
first of all there's uh amino acids
which are the
basic building blocks of life in
eukaryotes which is what we're talking
about here with humans
there's 21 of them proteins are chains
of amino acids and are the
workhorses of living organisms of cells
and they do all kinds of stuff from
structural to functional they service
catalysts for chemical reactions
they move stuff around they do all kinds
of things so they're both the building
blocks of life
and the doers and movers of life
hopefully i'm not being too poetic so
protein folding
is the fascinating process of going from
the
amino acid sequence to a 3d structure
there's a lot that could be said here
there's a lot of lectures on this topic
but let me quickly say some of the more
fascinating and important things
that i remember from a few biology
classes i took in high school in college
okay
so first is there's a fascinating
property of uniqueness
that a particular sequence usually maps
one to one to a 3d structure
not always but usually to me from an
outsider's perspective that's just
weird and fascinating the other thing to
say is that the 3d structure determines
the function of the protein
so one of the correlators of that is
that the underlying cause of many
diseases is the misfolding of proteins
now back to the weirdness of the
uniqueness of the folding there's
a lot of ways for a protein to fold
based on the sequence of amino acids
there's i think 10 to the power of 80
atoms in the universe so 10 to the power
143 is uh
a lot and you can look at 11th house
paradox which is one of the early
formulations of
just how hard this problem is and why
it's really weird that a protein is able
to do it so quickly
as a completely irrelevant side note i
wonder how many uh
possible chess games there are i
think i remember it being 10 to the
power of 100
something like that i think that would
also necessitate removing certain kinds
of infinite games
anyway off the top of my head i would
venture to say that the protein folding
problem
just in the number of possible
combinations
is much much harder than the game of
chess
but it's also much weirder you know they
say that life imitates chess
but uh i think that uh from a biological
perspective life is way weirder than
chess
anyway to say once again what i said
before is that the misfolding of
proteins
is the underlying cause of many diseases
and again i'll talk about the
implications that a little bit later
from a computational from a machine
learning from a dataset perspective
what we're looking at currently is 200
million proteins that have been mapped
and 170 000
protein 3d structures so much much fewer
and that's our training data
for the learning based approaches for
the protein folding problem
now the way those 3d structures were
determined is through
experimental methods one of the most
accurate being x-ray crystallography
which i saw university of toronto study
showing that it costs
about 120 000 per protein it takes about
one year
to determine the 3d structure so because
it costs a lot
it's very slow that's why you only have
170 000
3d structures determined now that's one
of the big
things that the alpha falls 2 system
might be able to provide is at least for
a large class of proteins
be able to determine the 3d structure
with a high accuracy
enough to be able to sort of open up the
structural biology field
entirely with sort of several orders of
magnitude more
protein 3d structures to play with
there's not currently a paper out that
describes the details of the alpha fold
two system
but i think it's clear that it's heavily
based on the alpha fold one system
from two years ago so i think it's
useful to look at how that system works
and then we can hypothesize speculate
about the kind of
methodological improvements in the alpha
fold two system
okay so for alpha fold one system
there's two steps in the process
the first includes machine learning the
second does not the first step
includes a convolutional neural network
that takes its input
the amino acid residue sequences plus a
ton of different features that their
paper describes
including the multiple sequence
alignment of evolutionary related
sequences
and the output of the network is this
distance matrix with the rows and
columns being the amino acid residues
they're giving a confidence distribution
of
the distance between the two amino acids
in the final geometric
3d structure of the protein then once
you have the distance matrix then you
have
a non-learning based gradient descent
optimization
of folding this 3d structure to figure
out
how you can as closely as possible match
the distances between the amino acid
residues
that are specified by the distance
matrix
okay that's it at a high level now how
does alpha fold two work
first of all we don't know for sure
there's only a blog post and some
little speculation here and there but
one thing is clear
that there's attentional mechanisms so i
think convolutional neural networks are
out
and transformers are in the same kind of
process that's been
happening in the natural language
processing space and really most of the
deep learning space
it's clear that attention mechanisms are
going to be taking over
every aspect of machine learning
so i think the big change is comnet is
out transformers are in
the rest is more in the speculation
space
it does seem that the msa the multiple
sequence alignment
is part of the learning process now as
opposed to part of the feature
engineering which it was in the original
step i believe it was only a source of
features please correct me if i'm wrong
on that
but it does seem like here it's not part
of the learning process
and there's something iterative about it
at least in the blog post
where there's a constant passing of
learned information between the
sequence residue representation which is
the
evolutionary related sequence side of
things and then the amino acid residue
to residue distances
that are more akin to the alpha fold one
system
how that iterative process works it's
unclear whether it's
part of one giant neural network or
whether several neural networks evolved
i don't know but it does seem that the
evolutionary related sequences are now
part of the learning process
it does seem that there's some kind of
iterative passing information and of
course
attention being involved into the entire
picture now
at least in the blog post the term
spatial graph is used
as opposed to sort of a distance matrix
or adjacency matrix so
i don't know if there's some magical
tricks involved in uh some
interesting generalization of an
adjacency matrix that's involved in a
spatial
graph representation or if it's simply
just using the term spatial graph
because there is uh
more than just pairwise distances
involved
in this version of the learning
architecture i think the two lessons of
the recent history of deep learning if
you involve attention if you evolve
transformers you're going to get a big
boost and the other lesson is that if
you make as much of the problem
learnable as possible
you're often going to see quite
significant benefits
this is something i've definitely seen
in the computer vision especially the
semantic segmentation side of things
okay why is this breakthrough important
allow this
computer scientist ai person to wax
poetic about some biology for a bit
so because the protein structure gives
us the protein
function figuring out the structure for
maybe millions of proteins might allow
us to learn
unknown functions of genes encoded in
dna also as i mentioned before
it might allow us to understand the
cause of many diseases that are
the result of misfolded proteins other
applications will stem from
the ability to quickly design new
proteins that in some way alter the
function of other proteins
so for treatments for drugs that means
designing proteins that fix
other misfolded proteins again those are
the causes of many diseases
i read a paper that was talking about
agriculture applications
of being able to engineer insecticidal
proteins or frost protective coating
stuff i know nothing about i read it
it's out there
tissue regeneration through
self-assembling proteins
supplements for improved health and
anti-aging
and all kinds of bio materials for
textiles and just materials in general
now in the long term or the super long
term future
impact of this breakthrough might be
just the advancement of end-to-end
learning
of really complicated problems in the
life sciences
so protein folding is looking at the
folding of a single protein
so being able to predict multi-protein
interaction
or protein complex formation which even
in my limited knowledge of biology i
think is a much much much harder problem
as far as i understand and just being
able to incorporate the environment
into the modeling of the folding of the
protein
and also seeing how the function of that
protein might change given the
environment
all those kinds of things incorporating
that into the end to end
learning problem then taking a step even
further
is this is physics biophysics so
being able to accurately do
physics-based simulation
of biological systems so if we think of
a protein as a one of the most basic
biological systems so then taking a step
out further and further in increasing
the complexity of the biological systems
you can start to think
of something crazy like being able to do
accurate
physics-based simulation of cells for
example or
entire organs or maybe one day being
able to do an
accurate physics-based simulation of the
very over-caffeinated
organ that's producing this very video
in fact how do we know this is not
a physics-based simulation of a
biological system
whose assigned name happens to be lex i
guess we'll never know
and of course we can go farther out into
super long-term sci-fi kind of ideas
of uh biological life and artificial
life which are fascinating ideas of
being able to play with simulation of
prediction
of um of organisms that are biologically
based or non-biologically based
i mean that's the exciting future of
end-to-end learning systems
that step outside the game playing world
of starcraft of chess and go
and go into the life sciences of real
world systems that operate in the real
world
that's where tesla autopilot is really
exciting that's where
any robots that use machine learning are
really exciting
and that's where this big breakthrough
in the space of structural biology
is super exciting and truly to me as one
humble human
inspiring beyond words speaking of words
for me these quick videos are fun and
easy to make
and i hope it's uh at least somewhat
useful to you
if it is i'll make more it's fun i enjoy
it i love it
really quick shout out to podcast
sponsors vincero watches
the maker of classy well-performing
watches i'm wearing one now
and for sigmatic the maker of delicious
mushroom coffee
i drink it every morning and all day as
you can probably tell from my voice now
please check out these sponsors in the
description to get a discount and to
support this channel
alright love you all and remember try to
learn something new
every day
you