Transcript

F1ka6a13S9I • Nuts and Bolts of Applying Deep Learning (Andrew Ng)
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0016_F1ka6a13S9I.txt
Back Raw
Kind: captions
Language: en
so you know when we're uh organizing
this Workshop My My co-organizers
initially asked me hey Andrew end of the
first day go give a Visionary talk so
until uh several hours ago my talk was
advertised as Visionary talk um but
until but but I Was preparing for this
presentation over the last several days
um I tried to think what what would be
the most useful information to you um
and what are the things that you know
you could take back to work on Monday
and and do something different at your
job next Monday and I thought that um Mr
context right now as pet mentioned I
lead BYU's AI team so team about a
thousand people working on Vision speech
NLP you know lots of applications of
machine learning and so what I thought
I'd do instead is um instead of taking
the shiniest pieces of deep learning
that I know I want to take the lessons
that I saw at BYU that are common to so
many different um academic areas as well
as applications and you know autonomous
cause augmented reality uh advertising
uh web search um medical diagnosis with
take what of the common lessons the
simple powerful ideas that I've seen
help drive a lot of machine learning
progress at BYU and I thought I will um
share those ideas of you because the
patterns I see across a lot of projects
I thought might be the patterns that
would be most useful to you as well
whatever you are working on in the next
several weeks or months um so one common
theme that will appear in in this
presentation today is that the workflow
of organizing machine learning projects
feels like parts of it are changing in
in the era of deep learning so for
example one of the ideas I talk about is
bias variance this is a super old idea
right and then you know many of you
maybe all of you have heard of buyers
and variance but in the era of deep
learning I feel like there have been
some changes to the way we think about
buyers and variance so we want to talk
about some of these ideas which maybe
aren't even deep learning per se but um
have been slowly shifting as as we apply
deep learning to more and more of our
applications okay oh and um instead of
holding all your questions until the end
you know if you have a question in the
middle feel free to raise your hand and
well I'm very happy to take questions in
the middle since this is a more maybe
informal whiteboard talk right and also
we want to say hi to all home viewers hi
right so um you know one question that I
still get asked sometimes is um and kind
Andre alluded to this earlier a lot of
the basic ideas of deep learning have
been around for decades so why are they
taking off just now right why is it that
deep learning these neon networks have
all know for maybe decades why why are
they working so well now so I think that
um the one biggest Trend in deep
learning the the is is is scale that
scale drives deep learning progress um
and uh I think Andrea mentioned scale of
data and scale of computation um and I'm
just draw a picture that illustrates
that concept maybe a little bit more
right so if I plot a figure where on the
horizontal axis I plot um the amount of
data we have for a problem and on the
vertic iCal axis we plot you know
performance right so x-axis is the
amount of spam data you've collected y
AIS is how accurately can you classify
spam um then if you apply you know
traditional learning
algorithms right what we found was that
the performance often looks like it
starts to Plateau after a while what was
as if the older generations of learning
algorithms including you know support V
what logistic regression as the
was as if they didn't know what to do
with all the data that we finally had
and what happened kind of over the last
20 years last last 10 years was with the
rise of the internet rise of Mobile Rise
of iot where as a society sort of
marched to the right of this curve right
for for for for many problems not all
problems and so um with all the buzz and
all the hype about deep learning in my
opinion the number one reason um that
deep learning algs work so well is that
if you train going to call a small
neuronet maybe you get slightly better
performance um if you
train a mediumsized
neuronet right maybe get even better
performance and is only if you
train a large neuronet that you could
train a model with the capacity to
absorb all this data that we have access
to that allows you to get the best as
possible performance and so I feel like
this is a trend that we' seen in many
verticals in many application areas um
couple comments one is that um this uh
you know actually when I draw this
picture some people ask me well does
this mean a small neuronet always
dominates a traditional learning
algorithm and the answer is not really
uh technically if you look at the small
data regime if you look at the left end
of this plot right um the relative
ordering of these algorithms is not that
well defined it depends on who's more
motivated to engineer the features
better right so if if you know the svm
guy is more motivated to spend more time
Eng doing features they might beat out
the uh uh the the the neuron Network
application but um uh because when you
don't have much data a lot of the
knowledge of the algorithm comes from
hand engineering right but so but this
trend is much more evident in the regime
of Big Data where you just can't hand
engineer enough features uh and and and
the large interet combined with a lot of
data tends to
outperform so couple of the comments um
the ification of this figure is that in
order to get the best performance in
order to hit that Target uh you need two
things right one is you need to train a
very large NE Network or reasonably
large NE Network and you need um a large
amount of data and so this in turn has
caused pressure to train large near net
Nets right build large Nets as well as
get huge amounts of data so one of the
other interesting Trends I've seen is
that um increasingly um it I'm I'm
finding that it makes sense to build an
AI team as well as build a computer
systems team and have the two teams kind
of sit next to each other and the reason
I say that is um I guess uh so let's see
what so when when we started you know bu
research we said our team that way other
teams are also organized this way I
think Peter mentioned to me that open AI
also has a systems team and a and a and
a machine learning team and the reason
we're starting to organize our teams
that way I think is that um some of the
computer systems work we do right so we
have an HPC team high performance team
super Computing team at BYU some of the
extremely specialized knowledge in HPC
is just incredibly difficult for for an
AI researcher to learn right some people
are super smart maybe maybe Jeff Dean is
smart enough to learn everything but but
it's just difficult for any one human to
be sufficiently expert in HPC and
sufficiently expert in um uh uh in in
machine learning and so we've been
finding and and shubo actually one of
the co-organizers is on our HPC team
we've been finding that bring Talent
from that Knowledge from these multiple
sources multiple communities allows us
to get our best
performance um I want to you know you've
heard a lot of present heard a lot of
fantastic presentations today I want to
draw one other picture which is um in my
mind this is how I mentally bucket you
know work in in in deep learning so this
might be a useful calization right when
you look at the talk you can mentally
put each talk into one of these buckers
I'm about to draw um but I feel like
there's a lot of work on I'm G to call
you know General DL General models and
this would basically what the type of
model that Hugo lell talked about this
morning where you have you know really
densely connected layers right um I
guess FC right was was the so there's a
huge bucket of models there um and then
I think a second bucket is um sequence
models so 1D sequences um and this is
where I would Buck could lot the work on
rnns uh you know lstms right grus um
some of the attention models which I
guess probably yosha r talk about
tomorrow or maybe maybe maybe others
maybe quas I'm not sure right um but so
the 1D sequence models is another huge
bucket um the third bucket is the image
models um this is really 2D and maybe
sometimes 3D but this is where I would
tend to bucket all the work of CNN
convolutional Nets and then in my mental
bucket then then there's a fourth one
which is the other right and this
includes uh unsupervised learning you
know uh uh uh the reinforcement learning
right as well as lots of other creative
ideas um being explored L and you like
what I still find slow fature analysis B
coding um U uh you know a various models
kind of in the other category super
exciting so it turns out that if you
look across industry today
um almost all the value today is driven
by these three bucket right so what I
mean is uh those three buckers of
algorithms are you know driving causing
us to have much better products right or
or monetizing very well it's just
incredibly useful for lots of things um
in some ways I think this bucket might
be the future of AI right so I find UNS
supervised learning especially super
exciting uh so so I'm I'm actually super
excited about this as well um although I
think that if you know on Monday you
have a job and you're trying to like
build a product or whatever the chance
of you using something from one of these
three buckets will be will be highest um
but I definitely encourage you to
contribute to research here as well
right so um I said the trend
one the the major Trends one of deep
learning is scale um this is what I
would say is maybe major Trend two of of
two of two Trends this is not going to
go on forever right um is I feel major
Trend too is um the rise of endtoend
deep learning
uh for Rich especially for Rich outputs
and so um end to end deep learning I'll
say a little bit more in a second
exactly what I mean by that but the
examples I'm going to talk about are all
from one of these three buckets right
General DL sequence models image 2D 3D
models um but let's see best Illustrated
a few examples um until recently a lot
of machine learning used to Output just
real numbers you know so I guess in
Richard's uh uh example you have Ave
movie review right and then actually but
I prepared totally different examples I
was editing the my examples earlier to
to to be more coherent with the speakers
before me um but we have a movie review
and then output the sentiment you know
is this a positive or A negative movie
review um or you might have an image
right and then you want to do uh image
nit object recognition you know so this
would be a01 output this might be a
integer from 1 to 1,00 but so until
recently a lot of machine learning was
about out putting a single number maybe
a real number maybe an integer um and I
think the the the number two major Trend
that I'm really excited about is um
enter and de learning algorithms that
can output much more complex things than
numbers and so one example that you've
seen is a image captioning where instead
of taking an image and saying this is a
cat you can now take an image and output
you know an entire string of texts using
RNN to generate that sequence so I guess
uh what um Andre
who spoke just now I think Oro vendal uh
uh shoe at BYU right a whole bunch of
people have have have worked on this
problem um one of the things that I
guess uh my my my collaborator Adam
coats will talk about tomorrow uh maybe
Quark as well not sure is um speech
recognition where you take as input
audio and you directly output you know
the text
transcript right and so um when we first
propose using this kind of ENT in
architecture to do speech recognition
this was very controversial we're
building my work of Alex Graves uh but
the idea of actually putting this in the
production speech system was very very
controversial when we first you know
said we wanted to do this but I think
the whole Community is coming around to
this point of view more recently um or
you know machine translation say go from
English to French right soas qu others
uh working on there a lot of teams now
um or you know given the parameters
um synthesize a brand new image right
and and and you saw some examples of
image synthesis so I feel like the the
the second major trend of of of deep
learning that that I find very exciting
and and I mean this allowing us to build
you know transformative things that we
just couldn't build three or four years
ago has this trend toward not just
learning algorithms an output not just a
number that can output very complicated
things like a sentence or caption or
French sentence or image or or or or or
let the recent wavenet paper output
audio right so I think this is a maybe
the second um major
Trend so um despite all
the excitement um about endtoend deep
learning um I think that end to end deep
learning you know sadly is not the
solution to everything um I want to give
you some rules of thumb for deciding
when to use what is exactly an learning
and when to use it and when not to use
it so was moving the second bullet and
we'll go through
these so the trend toward end to end
deep learning has been um this idea that
instead of engineering a lot of
intermediate representations maybe you
can go directly from your Ro input to
whatever you want to predict right so
for example actually a take I'm going to
use speech as a recurring example uh so
for speech recognition
um previously one used to go from the
audio to you know hand engineered
features like mfccs or something and
then maybe extract phon
names right um and then eventually you
try to generate the
transcript um oh for those of you that
aren't sure what a phone name is so uh
if you look at the word listen to the
word cat and the word kick the sound
right is the same sound and so pH names
are this uh um basic units of sound such
as c as a pH name and is um hypothesized
by linguist to be the basic unit of
sound so C would be the maybe the three
pH names that make up the word cat right
so traditional speech systems used to
used to do this uh and I think 2011
leang and Jeff Hinton um made a lot of
progress in speech recognition by saying
we can use deep learning to do this
first step um but the end to end
approach to this would be to say let's
forget about phes let's just have a
neuronet right input the audio and
output the
transcript um so it turns out that in
some problems this's endtoend approach
so one end is the input the other end is
the output so the phrase end to end deep
learning refers to uh just having a
neuronet or you know like a learning
algorithm directly go from input output
that's that's what n to end means um
this ENT end formula uh is I think it
makes for what great PR uh and and it's
actually very simple but it only works
sometimes um and actually maybe maybe
say this interesting story you know this
end to-end story we really upset a lot
of people um when we were doing this
work I guess I guess I used to go around
and say I think PHS are a fantasy of
linguists um and we should do away with
them and I still remember there was a
meeting at Stanford and some of you know
who it was there was a linguist kind of
yelling at me in public for saying that
so maybe maybe I should not H we turned
out to be right yeah so
all right um so let's see um but the the
the Ares heel of a lot of deep learning
is that you need tons of label data
right so if this is your X and that's
your y then for endtoend deep learning
to work you need a ton of label you know
input output data X comma y so to take
an example where um um where you know
one may or may not consider ENT deep
learning um this is a problem I learned
about just last week from ctis langas
and and Doin who's in the audience I
think of uh imagine you want to use um
X-ray pictures of your hand in order to
predict the child's age right so this is
a real thing you know doctors actually
care to look at an x-ray of your of a
child's hand in order to predict the the
age of the child so um boy let me draw
an x-ray image right so this is you know
the child's hand so these are the bones
right I
guess this is why I'm not a doctor okay
so that's a hand and and and you see the
bones um and so more traditional
algorithm my input an image and then
first you know extract the bones so
first figure out oh there's a bone here
there's a bone here there's a bone here
and then maybe measure the length of
these bones right
um so really I'm going to say bone
lengths and then maybe has some formula
like some regression average some simple
thing to go from the bone length to
estimate the age of the child right so
this is a non-end to-end approach to
trying to solve this problem an interent
approach would be to take an image and
then you know run a convet or whatever
and just try to Output the age of a
child and I think this is one example of
a problem where um it's very challenging
to get end to end deep learning to work
because you just don't have enough data
you just don't have enough X-rays of
children's hands annotated with dat ages
and instead where we see deep learning
coming in is in this step right to use
go from image to to figure out where the
bones are use deep learning for that but
the advantage of this non-end
architecture is it allows you to hand
engineer in more information about the
system such as how bone lengths map age
right which which you can kind of get
tables about um there are a lot of
examples like this and I think one of
the unfortunate things about deep
learning is that um let's see uh you
know you can for for for suitably sexy
values of X and Y you could almost
always train a model and publish a paper
but that doesn't always mean that you
know it's actually a good idea
Peter I see yeah I see yeah I see yes
that's true yes Pet's poting out that in
practice you could um uh if this is a
fixed function f right you could back
propop all the way from the age all the
way back to the image yeah that's a good
idea actually who was it just said you
better do it
quickly yeah um Let me give a couple
other examples uh uh that where where it
might be harder to backdrop all the way
through right so here here's an example
um take self-driving cars you know most
teams are using an architecture where
you input an image what's in front of
the car let's say and then you you know
detect other cars right uh and then and
and maybe use the image detect
pedestrians right self-driving cars are
obviously more complex than this right
uh but then now that you know where the
other cars and where the posss are
relative to your car you then have a
planning
algorithm uh uh to then you know come up
with a
trajectory right and then now that you
know um what's the trajectory that you
want your car to drive through um you
could then you know compute the steering
direction right let's
say and so um this is actually the
architecture that most self-driving car
teams are using um and you know that
have been interesting approaches to to
say well I'm going to input an image and
I'll put a steering
direction right and I think this is an
example of where um at least with
today's data technology I'd be very is
about the second approach and I think if
you have enough data the second approach
will work and you could even prove a
theorem you know showing that it will
work I think but um I don't know that
anyone today has enough data to make the
second approach really really work well
right and and I think kind of Peter made
a great comment just now and I think you
know some of these components will be
incredibly complicated you know like
this could be a pop Planet ex explicit
search and you could actually design a
really complicated powerp plan and
generic the trajectory and your ability
to hand code that still has a lot of
value right so this is one thing to
watch out for um I have seen project
teams say I can get X I can get y I'm G
to train deep learning um but unless you
actually have the data you know some of
these things make for great demos if if
you cherry pick the examples but but it
can be challenging to um get to work at
scale I I should say for self-driving CS
this debate is still open I'm I'm
cautious about this I don't think is
this I don't think this will necessarily
fail I just think the data needed to do
this will be will be really immense so I
I'd be very cautious about and and right
now but it might work if you have enough
data um so you know one of the themes
that comes up in machine learning really
if you're work on a machine learning
project one thing that'll often come up
is um you will you know develop a
learning system uh train it maybe
doesn't work as well as you're hoping
yet and the question is is what do you
do next right this is a very common part
of a machine learning you know research
or a machine learning engineer's life
which is you know you you train a model
doesn't do what you want it to yet so
what do you do next right this happens
us all the time um and you face a lot of
choices you could collect more data
maybe you want to train it longer maybe
you want a different neuron Network
architecture maybe you want to try
regularization maybe you know bigger
model for some more gpus so you have a
lot of decisions and I think that um a
lot of the skill of a machine learning
researcher machine learning engineer is
knowing how to make these decisions
right and and and the difference in
performance and whether you you know do
you train a bigger model or do you try
regularization your skill at picking
these decisions will have a huge impact
on how rapidly um uh you can make
progress on actual machine learning
problem so um I want to talk a bit about
bias and variance since that's one of
the most basic you know Concepts in
machine learning and I feel like it's
evolving slightly in the era of of of
deep learning so to use a as a motiving
example
um let's say the goal is to build a
human
level right uh Speech
system right speech recognition system
okay so um what we would typically do
especially in Academia is we'll get a
data set you know here's my data set a
lot of examples and then we Shuffle it
and we randomly split it into 7030 train
tests or maybe or maybe 70% train you
know 15% Dev and uh 15% test right we oh
and uh some people use the term
validation set but I'm I'm just use the
dep set or stand for development set
means the same thing as validation set
okay so it's pretty common um and so
what we would what what what I encourage
you to do if you aren't already is to
measure the following things um human
level
error so let actually let me illustrate
an example let's say that on your deel
uh uh let's say that um on your death
set you know human level error is uh 1%
error
um let's say that your training set
error is um use 5% and let's say that
your def set
error really very de set is appr proxy
for test set except you tune to the dev
set right is um you know 6% d
okay so this is one of the most basic
this this is really a a step in
developing a learning Al that I
encourage you to do if you aren't
already to figure out what are these
three numbers because these three
numbers um really helps in terms of
telling you what to do next so in this
example um you see that you're doing
much worse than human level performance
um and so you see that there's a huge
gap here from 1% to 5% and I'm going to
call this you know right the bias of
your learning algorithm
um and for the statisticians in the room
I'm using the terms buys and variance
informally and doesn't correspond
exactly to the way they're defined in
textbooks but I find these useful
concepts for for for deciding how to
make progress on your problem um and so
I would say that you know in this
example you have a high bias class by
try training a bigger model maybe try
training longer we come back to this in
a second um for a different example you
know so this is one example uh for a
different example if human level error
is 1% and uh training set error with
2% right and death set error was 6% then
you know you really have a high what
variance problem right like an
overfitting problem and this tells you
this really tells you what to do what to
try right try adding regularization or
try um uh uh or try early stopping or um
or or or even better we get more
data um and then there's also really a
third case which is if you have
a 1% human level error um I'm going to
say 6% death set error oh actually let
me say 5% death set error and
10% um excuse me 5% training error and
10% death set error and in this case you
have high bias and high variance right
um
so so I guess yeah High buys and high
VAR you know like sucks for you right um
so I feel like that when I talk to
applied machine learning teams there's
one really simple
workflow um that is enough to help you
make a lot of decisions about what you
should be doing on your machine learning
application um and by if if you're
wondering why I'm talking about this and
what this has to do with deepy I'll come
back to this in a second right does this
change in Era deep learning but uh uh I
feel like this is you know almost a
workflow like almost a a flow chart
right right which is first ask yourself
um is your training error
high oh and I hope I'm writing big
enough that people can see if if you
have a trouble reading let me know and
I'll and I'll read it back out right but
first I ask you know are you even doing
well in your training set um and and and
if your training error is high then you
know you have high bias and so your
standard tactics like train a bigger
model just train a bigger NE Network
um or maybe try training longer you know
make sure that your optimization
algorithm is is doing a good enough job
um and then there's also this magical
one which is a new model architecture
which is a hard one right um come back
to that in a second okay and then you
kind of keep doing that until you're
doing well at least on your training set
once you're at least doing well on your
training set so your training error is
no longer high so no training error is
not unacceptably High um we then ask you
know is your death
error
high right and if the answer is yes then
um well if your dep set error is high
then you have a high variance problem
you have an overfitting problem and so
you know the solutions are try to get
more
data right or add
regularization or try a new model
architecture
right and then until and and you kind of
keep doing this until your uh dep set
error is is is is no is I guess until
both you're doing well on your training
set and on your death set and then you
know hopefully right you're done so I
think one of the um one of the nice
things about this era of deep learning
is that no matter it's kind of no the
way you're stuck with modern deep
learning tools we have a clear path for
making progress in a way that was not
true or at least was much less true in
the era before deep learning which is in
particular no matter what your problem
is overfitting or underfitting uh really
high buys or high Varian or maybe both
right you always have at least one
action you can take which is bigger
model or more data so you could so so so
in the Deep learning era relative to say
the logistic regession era the svm era
it feels like we more often have a way
out of whatever problem we're stuck in
um and so I feel like these days people
talk less about buyers variance
trade-off you might have heard that term
buyers variance trade-off underfitting
versus overfitting and the reason we
talked a lot about that in the past was
because a lot of the moves available to
us like tuning regularization that
really traded off buyers and variance so
it was like a you know zero something
right you you could improve one but that
makes the other one worse but in the era
of deep learning really one of the
reasons I think deep learning has been
so powerful
is that the coupling between buyers and
variance can be weaker and we now have
tools we now have better tools to you
know reduce buyers without increasing
variance or reduce variance without
increasing buyers and really the bigger
the the the big one is really you can
always train a bigger model bigger
neuron Network in a way that was harder
to do when you're training logistic
regression is to come up with more and
more features right that was just harder
to do
um
so let's see one of the um and I'm I'm
going to add more to this diagram at the
bottom in a second okay um one of
the
effects of uh this maybe this and and
and by the way I've been surprised I
mean honestly um this new model
architecture that's really hard right it
takes a lot of experience but but even
if you aren't super experienced with you
know a variety of deep learning models
the things in the blue boxes you can
often do those and that would drive a
lot of progress right but if you have
experience with you know how to tune a
confet versus a resonet versus Whatever
by all means try those things as well
definitely encourage you to keep
mastering those as well but this dumb
formula of more data bigger bigger model
more data is enough to do very well on a
lot of
problems so um let's
see uh so bigger model puts pressure on
you know systems which is why we we have
high performance Computing team um more
data has has led to another interesting
um set of Investments so uh with you
know I guess a lot of us have always
what needed that had this insatiable
hunger for data we use you know trout
sourcing for labeling um uh we try to
come with all sorts of clever ways to
come to to to get data um one one area
that that I'm seeing more and more
activity in right it feels a little bit
nent but I'm seeing a lot of activity in
is um automatic data synthesis right um
let's see
and so here's what I
mean you know once upon a time people
used the hand engineer features and
there was a lot of skill in hand
engineering the features of you know
like the CIF or the hog or whatever to
feed into svm um automatic data
synthesis is this little area that is
small but feels like it's growing where
there is some hand engineering needed
but I'm seeing quite a lot of progress
in multiple problems is enabl by hand
engineering uh synthetic data in order
to feed into the giant mole of your
neuron Network right so let me best
Illustrated a couple examples um one of
the easy ones is OCR so so let's say you
want to train a um optical character
recognition system and actually I've
been surprised that by do this has tons
of users actually this is one of my most
useful apis that I you right um if you
imagine firing up Microsoft Word um and
uh downloading a random picture off the
Internet then choose a random Microsoft
Word font choose a random word in
English dictionary
and just type the English word into
Microsoft Word in a random font and
paste that on top you know like a
transparent background on top of a
random image off the internet then you
just synthesize a training example for
OCR right um and so this gives you
access to essentially unlimited amounts
of data it turns out that the simple
idea I just described won't work in its
natural form you actually need to do a
lot of tuning to blur the synthesize
text with the background to make sure
the color contrast matches your training
distribution so found it in practice can
be a lot of work to find you and how you
synthesize data but I've seen in many
verticals um I'll give a few examples if
you do that engineering work and sadly
it's painful engineering you could
actually get a lot of progress actually
actually ta Wang uh who was a a student
here at Stanford um uh the effect I saw
was he engineered this for months with
very little progress and then suddenly
he got the parameters right and he had
huge amounts of data and was able to
build one of the best OCR systems in the
world at that time
right um other examples speech
recognition right uh one of the most
powerful ideas uh for building a
effective speech system is if you take
clean audio you know a clean relatively
noises audio and take random background
sounds and just synthesize what that
person's voice would sound like in the
presence of that background noise right
and this turns out to work remarkably
well so if you recall a lot of car noise
what the inside of your car sounds like
and record a lot of clean audio of
someone speaking in a quiet environment
um the mathematical operation is
actually addition it's superposition of
sound but you basically add the two
waveforms together and then you get an
audio clip that sounds like that person
talking in the car and you feed this
your learning algorithm and so this has
a dramatic effect in in terms of
amplifying the training set for speech
recognition and has a huge effect can
have a we found a huge effect on um
performance um and then also NLP you
know here here's here's one example
actually done by some Stanford students
which is um using entend deep learning
to do grammar correction so input a
ungrammatical English sentence you know
maybe written by non-native speaker
right and can you automatically have a
have a I guess attention RNN input an
ungrammatical sentence and correct the
grammar just edit the sentence for me um
and it turns out that you can synthesize
huge amounts of this type of data
automatically and so that'll be another
example where data synthesis um works
very
well um and oh and I think uh uh video
games in RL right really one of the um
well let me just games broadly right one
of the most powerful um uh applications
of RL deep RL these days is video games
and I think if you think supervised
learning has an insatable hunger for
data wait till you work on AO algorithms
right I think the the hunger for data is
even greater but when you play video
games the advantage of that is you can
synthesize almost infinite amounts of
data to to feed this even greater more
right even greater need that our ARS
have
um so just one note of caution data
synthesis has a lot of limits um I'll
tell you one other story um you know
let's say you want to recognize cars
right uh there are a lot of video games
um I need to play more video games
what's a video game with cause in it oh
GTA Grand Theft Auto right so there a
bunch of cars in Grand Theft Auto why we
just take pictures of cars from Grand
Theft Auto and you can synthesize lots
of cars lots of orientations there and
paste that give that as training data um
it turns out that's difficult to do
because
from the human perceptual system there
might be 20 cars in a game but it looks
great to you because you can't tell if
there 20 cars in the game or a thousand
cars in a game right and so there are
situations where the synthetic data set
looks great to you because 20 cars in a
video game is plenty it turns out uh you
don't need a 100 different calls for the
human to think it looks realistic but
from perspective of learning algorithm
this is very impoverished very very poor
data set so so I think so so so a lot to
be to be sorted out for data synthesis
um
for those you that work in companies one
one practice I would strongly recommend
is to have a unified data
warehouse
right um so what I mean is that if your
teams if your you know engineer teams
your research teams are going around
trying to accumulate the data from lots
of different organizations in your
company that's just going to be a pain
it's going to be slow so um at buo you
know our our policy is um it's not your
data is a company's data and if it's
user data it goes into my user data
warehouse uh we we we should have a
discussion about user access rights
privacy and who can access what data but
at BYU I felt very strongly so we we
mandate this data needs to come into one
loog it's a logical warehous right so
it's physically distributed across loss
of data censes but they should be in one
system and what we should discuss is
access rights but what we should not
discuss is whether or not to bring
together data into as unified a data
warehouse as possible and so this is
another practice that I found
um makes uh access the data just much
smoother and allows you know teams to to
to to drive performance so really if if
if your boss ask you tell them that I
said like build a unified data warehouse
right so um I want to take the
uh train test you know bias variance
picture and refine it it turns out that
this idea of a 7030 split right train
test or whatever this was common
in um machine learning kind of in the
past when you know frankly most of us an
Academia were working on relatively
small data sets right and so I know
there used to be this thing called the
UC Irvine repository for machine
learning data sets you know by today's
this amazing results at the time but by
today standards is quite small and so
you download the data set shuffle the
data set and you have you know train Dev
test and whatever um in today in
production machine learning today is
much more common
for your train and your test
distributions to come from different
distributions right and and and this
creates new problems and new ways of
thinking about bu and VAR so let me sure
talk about that um so actually here's a
concrete example and this is a real
example from Buu right what builds a
very effective speech recognition system
and then recently actually actually
quite some time back now we wanted to
launch a new product that uses speech
recognition um we wanted a speech
enabled rear viiew mirror right so you
know if you have a car that doesn't have
a built-in GPS unit right uh we wanted
this is a real product in China we want
to let you take out your rearview mirror
and put a new you know AI power speech
part rearview mirror because it's an
easier uh uh like a off the market
installation so you can speak to
rearview mirror and say dear Rew mirror
you know navigate me to whatever right
so this is a real product um so so so
how do you build a speech recognition
system for this incar speech enable rear
view mirror um so this is our status
right we have you know let's call it
50,000 hours of data from of speech
recognition data from all sorts of
places right a lot of data we bought
some user data that that that we have
permission to use but a lot of data
collected from all sorts of places but
not your incar rear viiew mirror
scenario right and then our product
managers can go around and you know
through quite a lot of work for this
example I'm going to say let's say they
collect 10 hours more of data from
exactly the rear view
mirror scenario right so you know
install this thing in the car get drive
around talk to it you collect 10 hours
of data from exactly the distribution
that you want to test on so the question
is what do you do now right do you throw
this 50,000 hours of data away because
it's not from the distribution one or or
or can you use it in some way um in the
older pre deep learning days people used
to build very separate models it was
more common to build one speech model
for rearview mirrror one model model for
the maps voice query one model for
search one model and in the era of deep
learning it's becoming more and more
common to just power all the data into
one model and let the model sorted out
and so long as your model is big enough
you could usually do this um and if you
do little Tech if you get the features
right you could usually pile all the
data into one model uh and often see
gains butly usually not see any losses
but the question is given this data set
you know how do you split this into
train dep test right so here's one thing
you could do which is call this your
training set call this your dep set and
call this your test set right um turns
out this is a bad idea I would not do
this and so one of the best practices
with with' derived is um make
sure your development set and test sets
are from the same
distribution right I've been finding
that this is one of the tips that really
boost the effectiveness of a machine
learning team um so in particular I
would make this the training set and
then of my 10 hours let me expand this a
little bit right much smaller data set
maybe five hours def five hours of tests
and the reason for this is um uh your
team will be working to tune things on
the death set right and the last thing
you want is if they spend three months
working on the death set and then
realize when they finally tested that
the test is totally different a lot of
work is wasted so I think to make an
analogy you know having different dep
and test set distributions is a bit like
if I tell you hey everyone let's go
north right and then a few hours later
when when all of you are in Oakland I
say where are you wait I want you to be
in San Francisco and you go what why'
you tell me to go north tell me to go to
San Francisco right and so I think
having depth and test set be from the
same distribution is one of the ideas
that I found really optimizes the team's
efficiency because it you know the
development set which is what your team
is going to be tuning as algorithms to
that is really the problem specification
right and you problem specification
tells them to go here but you actually
want them to go there you're going to
waste a lot of effort um and so when
possible having Deen tests from the same
distribution which it isn't always there
there there's some cavas but when is
reason well to do so um this really
improves the the the the um the team's
efficiency um and another thing is once
you specify the de set that's like your
problem specification right uh once you
start the test set that's your problem
specification your team might go and
collect more training data or change the
training set or synthesize more training
set but but you know you shouldn't
change the test set if the test set is
is is your problem specification right
so um so in practice what I actually
recommend is splitting a training set as
follows um your training set cover a
small part of this let me just say 20
hours of data to form I'm going to call
this the um
training def set train dasde set but
that's basically a development set
that's from the same distribution as
your training Set uh and then you have
your depth set and your test set right
so these are what you actually from the
distribution you actually care about and
these you have your training set $50,000
of all sorts of data and maybe we aren't
even entirely sure what data this is uh
but split off just a small part of this
so I guess this is now what
49980 hours and 20 hours um and then
here's the generalization of the bias
variance concept um actually let me use
this
board and and but has say the the um the
fact that training and test sets don't
match is one of the problems that um
Academia doesn't study much there's some
work on domain adaptation there is some
literature on it but it turns out that
when you train and test on different
distributions you know it it sometimes
it's just random is a little bit luck
whether you generalize well to a totally
different test set so that's made it
hard to study systematically which is
why I think um Academia has not studied
this particular problem as much as I
feel it is important for to to those of
us building production systems um but
there is some work but but not no no no
very widely deployed Solutions yet would
be would my sense um but so I think our
best practice is if if you now
generalize what I was describing just
now to the following which is um measure
human level performance measure your
training set
performance measure your training death
performance measure your death set
performance and measure your test set
performance right so now you have kind
of five numbers so to take an example
let's say human level is 1% error um and
I'm going to use very obvious examples
for illustration if your training set
performance is 10% you know and this is
10.1%
right uh
10.1% you know
10.2% right in this example then it's
quite clear that you have a huge gap
between human level performance and
training set performance and so you have
a huge
bias right uh and and and so kind of use
the the the bias fixing types of um uh
uh Solutions um and then um there just
one example I want to well and so I find
that the machine learning one of the
most useful things is to look at the
aggregate error of your system which in
this case you know is your depth set
your tested era and then to break down
the components to to figure out how much
of what eror comes from where so you
know where to focus your attention so
this accumulation of errors this
difference here this is maybe 9% bias
which is a lot so I would work on the
bias reduction techniques uh this Gap
here right this is kind of um really the
variance this Gap here is due to your
train test distribution
mismatch um and this is overfitting of
Dev okay right um so just to be really
concrete um here's an example where you
have high train
test error mismatch right which is if
human level performance is 1% % your
training error is you know 2% uh your
training death is
2.1% and then on your death Set uh the
error suddenly jumps to 10% right so
this would sorry my my my my x-axis
doesn't perfectly line up but if there's
a huge gap here then I would say you
have a huge train test set mismatch
problem okay um and so at this basic
level of analysis what you know this
formula for machine learning
instead of Dev I would replace this with
train Dev right and then in the rest of
this uh really recipe for machine
learning um I would then
ask um is your death error
high if yes then you have a train test
mismatch problem and there the solution
would be to try to get more data
uh that's similar to test
set right or maybe a data
synthesis or data augmentation you know
try to tweak your training set to make
it look more like your test set um and
then there's always this kind of a uh
Hail Mary I guess which is you know new
architecture
right um and then finally just to finish
this up you know that that that's not
that much more uh finally uh uh there's
this yeah well and then hopefully if
you're done uh uh hopefully your test
set error will be will be good and if if
you're doing well your death set but not
your test set it means you've overit
your death set so just get some more
death set data right actually I'll just
write this I guess test set error
y right and if yes then just get more
depth data
okay and then
done sorry if this is not too legible
what I wrote here is uh if your dep set
error is not high but your test set
error is high it means you've overfit
your dep set so just get more test set
get more depth set data
okay
um so one of the um effects I've seen is
bias and variance is it sounds so simple
but it's actually much diffic much more
difficult to apply in practice than it
sounds when I talk about it or or on
text right so some tips for a lot of
problems just calculate these numbers
and this can help drive your analysis in
terms of deciding what to do
um yeah and and I find that it takes
surprisingly long to really Gro to
really understand bu and variance deeply
but I find that people that understand
buys and variance deeply are often able
to drive very rapid progress in in in
machine learning applications right and
and I know it's much sexier to show you
some cool new network architecture and I
don't know and and and and just this
really helps our teams make rapid
progress on
things
um so you know there's one thing I I I
kind of snuck in here without making it
explicit which is that in this whole
analysis we were benchmarking against
human level performance right so there
another Trend another thing that that
that has been differenced uh again you
know I'm looking across a lot of
projects I've seen in many areas and
trying to pull out the common Trends but
I find that comparing to human level
performance is a much more common theme
now than several years ago right with
with I guess Andre being the uh the the
human level Benchmark for image net uh
um and and and really by do we compare
our speech system to human level
performance and try to exceed it and so
on so why is that um it turns out that
so why why why is human level
performance right such a such a common
theme in in applied deep learning um it
turns out that if um this the x-axis is
time as in you know how long you've been
working on a project and the y- axis is
accuracy right if this is human level
performance you know like human level
accuracy or human level performance on
some task you find that for a lot of
projects your teams will make rapid
progress you
know up until they get to human level
performance and then often it will maybe
surpass human level performance a bit
and then progress often gets much harder
after that right but this is a common
pattern I see in a lot of problems um so
there multiple reasons why this is the
case I'm I'm curious like why why why
why do you think this is the case any
any guesses
yeah cool labels are coming from humans
the labs are oh cool yep labels coming
from humans anything
else all right cool anything
else oh interesting Ox small out the
human brain yeah I don't know maybe I I
think the the the the distance from
neuronet to human brains is very far so
that one I would
uh I see human capacity this from
similar yeah kind of yeah all close yeah
just okay board I see see right cool
that's one more and then I'll
just oh be satisfied okay cool you're
satisfied and bought I guess on two
sides of the coin I
guess all right
so oh defens human man yeah yeah cool so
so let let me let me let me uh I think
there there all all all you know lots of
great answers um I think that there
there there are several good reasons for
this type of effect um one of them is
that um there is for a lot of problems
there is some theoretical limit of
performance right if if you know some
fraction of data is just noisy in speech
recognition a lot of audio CPS are just
noisy uh someone picked up a phone and
you know they're in a rock console or
something and it's just impossible to
figure out what on Earth they were
saying right or some images you know are
just so blurry just impossible to figure
out what dis is so there is some upper
limit theoretical limits of performance
um called the optimal error rate
right and and the basian will will call
this the base rate right but really
there is some theoretical Optimum where
even if you had the best possible
function you know with best possible
parameters it cannot do better than that
because the input is just noisy and
sometimes impossible to
label so it turns out that um humans are
pretty good at a lot of the tasks we do
not all but humans actually pretty good
at speech recognition pretty good at
computer vision and so you know by the
time you surpass human level accuracy
there might not be a lot of room right
to go to go further up so that's kind of
one reason as just humans are pretty
good um other reasons I think a couple
people said right um and and it turns
out that um so long as you're still
worse than humans uh you have better
levs to make progress
right um so you know while while worse
than
humans um right have good
ways uh to make
progress and so some of those ways are
right a couple of you mentioned this you
can get labels from humans
right um You can also carry out error
analysis and error analysis just means
look your death set look the examples
your Al got wrong and see you know see
if the humans have any insight into why
a human thought this is a c AR thought
it was a dog or why a human you know
recognized this utterance correctly but
your system just mistr transcribed this
um and then I think another reason is
that it's
easier to
estimate um buas variance
effects right and here's what I
mean um so let's see to take another
confute example let's say that uh you
let's say that you're working on some
image recognition
task right if I tell you that um uh your
training error
is
8% um and your death
error is
10% right well should you work on you
know bias reduction techniques that you
should work on variance reduction
techniques is actually very unclear
right if I tell you that
humans get
7.5% then you're pretty close on the
trading being said to human and you
would think you have more of a variance
problem if I tell you humans can get
1% 1% error then you know that even on
the training side you're doing way worse
than humans and so well you should build
a bigger Network or something right so
this piece of information about where
humans are and and I think of humans as
a proxy as an approximation for the Bas
error rate for the optimal error rate
this piece of information really tells
you where you should focus your effort
and therefore increases the efficiency
of your team but once you surpass human
level efficiency I mean if if if even
humans you know got um a 30% error right
then then is is it's just slightly
tougher so that's is another thing that
that that becomes harder to do that you
no longer have a proxy for estimating
the base error rate to decide how to
improve performance right um so you know
there definitely lots of problems where
surpass human level performance and keep
getting better and better but I find
that uh uh a lot of the I find that my
life building deep learning applications
is often easier until we surpass human
level performance which is much better
tools and after we surpass human level
performance um well actually you one the
details what we usually try to do is try
to find subsets of data where we still
do worse than humans so find let's say
so for example right now we surpass
human level performance for speech
accuracy of for short audio clips taken
out of context but we find for example
we're still way worse than humans on one
particular type of accented speech then
even if we are much better than humans
in the aggregate if we find what much
worse than humans on the subset of data
then all these levers still can apply
but this kind of an advanced topic maybe
where you segment the training center
and analyze sub separate subsets of
training Cent
yeah yeah I see actually you know that's
a wonderful question I want to ask a
related Qui question to everyone in the
audience and we going come back to to to
what Alex just said right so
um given everything we just said um I
have another quiz for you right um I'm
going to pose a question uh write down
four choices and then ask you to raise
your hand to to to to vote what you
think is the right answer okay so um I
talked about you know how the concept of
human level accuracy is useful for
driving machine learning progress right
so um how do you define human level
performance right so here's good example
um I'm spend a lot of time working on AI
Healthcare so a lot of medical examples
in my head right now but let's say that
you want to do medical imaging for
medical diagnosis you know so read
medical images tell your patient a
certain disease or not right so um so
medical
example so my question to you is how do
you define human level performance um
Choice a is um you know a typical human
so non-doctor right let's say that the
error rate at Reading a certain type of
medical image is
3%
right choice B is a typical
doctor let's say a typical doctor makes
1%
error um or I can find an expert
doctor and let's say an expert doctor
Mak 0.7% error or I can find a team
of expert
doctors and what I mean is if I find a
team of expert doctors and have a team
look at every image and debate and
discuss and have them come to you know
the team's best guess of what's
happening to this patient let's say I
can get 0.5% error so think for a few
seconds I'll ask you to vote by by by
raising your hands which of these is the
most useful definition of human level
error if you want to use this to drive
the performance of your oper
okay so who thinks Choice a raise your
hand oh
sure uh uh uh yeah don't worry about
ease of obtaining this data yeah right
so which is the most useful definition
Choice a who
anyone just a couple people choice B who
thinks uses Cool like a fifth Choice C
exper doctus another fifth Choice D oh
cool wow interesting all right so so
I'll tell you that um I think that for
the purpose of driving machine learning
progress I think uh ignoring the cost of
collecting data was a great question um
I would find this definition the most
useful um because I think that um a lot
of what we're trying to use human level
performance as a proxy for is the base
rate is really the optimal error rate
right and and really to measure the
Baseline level of noise in your data um
and so you know if a team of human
doctors can get 0.5% then you know that
the mathematic optimal error rate has
got to be 0.5% or maybe even a little
bit better um and so for the purpose of
using this number to drive all these
decisions such as um estimate bias and
variance right uh uh that definition
gives you the best you know estimate of
bias right um uh because you know that
the base error rate is is 0.5 or lower
um in practice because of the cost of
you know getting labels and so on in
practice you know I would fully expect
teams to use this definition uh and and
and by the way publishing papers is
different than um the goal of publishing
papers is different than the goal of
actually you know building the best
possible product right so for the
purpose of publishing papers people like
to say oh we're better than the human
level so for that I guess using this
definition would be what many people do
um uh and and and if you actually trying
to collect data you know there'll be
some tearing where right get a typical
doctor to label example if they arure
hire an expert doctor if they still
unsure then find you know so so for the
purpose of data collection you you other
processes but for the mathematical
analysis I would tend to use 0.5 as as
as as my definition for that number
question the
back oh is it possible that team of
expert doctors does worse than a single
doctor I don't know I I I had to ask the
doctors in the audience know
all right um all right just just I just
two more pages and wrap up um so you
know one of the reasons I think in the
era of deep learning we uh refer to
human level performance much more
frankly is because um for a lot of these
toss we are approaching human level
performance right so when computer
vision accuracy you know when when I
guess maybe to continue this example
right um if you know when your training
set accuracy in computer vision was you
know 30% and your death error was like
35% then it didn't really matter if
human level performance was 1% or 2% or
3% it didn't affect your decision that
much because you're just so clearly far
so far from base rate but now as really
more and more deep learning systems are
approaching human levels performance on
lot these TS measuring human level
performance uh actually gives you very
useful information to to to drive
decision making and so honestly for a
lot of the teams I work with when I meet
with them a very common piece of advice
is he's going to figure out what is
human level performance and and they
didn't spend some time to have humans
labor and get that number because that
number is useful for for for for driving
some of these
decisions so um just two last things and
then we'll finish
um you know one question I get asked a
lot is um what can AI do really what can
deep learning do right um and
and I guess maybe passion a company you
often you know with the rise of AI I
feel like um uh maybe this is again a
company thing um in silic Valley we've
developed pretty good workflows for
Designing products in the desktop era
and in the mobile era right so with
processes like draw a wireframe the
designer draws a wireframe excuse me the
the the product manager draws a
wireframe the designer does the visual
design or something or they work
together and then the program implements
so we have well-defined workflows for
how to design you know typical apps like
the Facebook app or the Snapchat app or
whatever we sort of know how to design
we have workflows established and
companies to design stuff like that in
the era of AI um I feel like we don't
have good processes yet for Designing AI
products so for example how should a
product manager specify you know I don't
know a self-driving C how do you specify
the product definition how does a
product manager specify what level of
accuracy is needed for my cat detector
is like how how how so today in Silicon
Valley with AI working better and better
um I find us inventing new processes in
order to design AI product right
processes that really didn't exist
before but one of the questions I often
get asked partially sometimes at product
people some business people is what can
AI do because when a product manager is
trying to design a new thing you know
it's nice we can help them know what
they can design and what they can design
there's no way we can build right so so
so so when I so so I want to give you
some rules of thumb that are far from
perfect but that I found useful for
thinking about what AI can do oh oh
before I tell you the rules I use um
here's one of the rules of thumb that a
product manager I know was using which
is he says assume that AI can do
absolutely anything right and and and
and this actually wasn't terrible it
actually led to some good results but
but I want I want to but I want to give
you some some some more Nuance uh uh
ways of communicating about modern deep
learning um in in in in in these sorts
of organizations you know one is um
anything that a
person a typical
person can do in less than one seconds
right and I know this rule is far from
perfect there a lot of counter examples
of this but this is one of the rules I
found useful which is that if it's a
task that a normal person can do with
less than one second of thinking there's
a very good chance we can automate it
with deep learning so so you know given
a piece of given a picture tell me if
the face in this picture is smiling or
frowning you don't need to think for
more than a second so yes we can build
deep loading systems and do that really
well right um or speech recognition you
know like uh listen to this audio clip
what did they say you don't need to
think for that long less than a second
so this is really a lot of the
perception work uh uh in computer vision
speech um that uh deep learning is
working on this rule of thumb works less
well NLP I think because humans just
take time to vext bit but we found
um right now at BYU we're a bunch of
product managers looking around for
tasks that humans can do in less than
one second uh to try to automate them so
this been highly thought but but still
useful R of thumb um
question I see
yeah yeah actually great question I feel
like a lot of the value of deep learning
a lot a lot of the the the concrete
short-term applications um a lot of them
have been um trying to automate things
that people can do uh really especially
people can do in a very short time and
and this feeds into all the advantages
you know when when you're trying to
automate something that the human can
already do
um oh I see oh that's inter observation
oh if a human can label in less than a
second you can get a lot of data yeah
that's observation cool yeah right um
and then I think another one the the
other huge bucket of deep learning
applications that I've seen create tons
of value is um uh predicting
outcome of the
next uh in sequence of
events right um but so you know if
there's something that happens over and
over such as you know maybe not super
inspiring we show a user an act right
that happens a lot uh uh and and the
user clicks on it or doesn't click on it
with tons of data to predict if the user
will click on the next app probably most
lucrative application AI de leading
today or you know by do we run a food
delivery service so we're seen a lot of
data of if you order food from this
restaurant to go to this destination at
this time of day how long does it take
we've seen that a ton of times very good
at predicting if you order food how long
will take to to to to send this food to
you so I feel like I don't know I you
know deep learning does so much stuff so
I've struggled a bit to come up with
simple rules explained
to to Really to product managers right
how to design around it I found these
two rules useful even though I know
these are clearly highly FAA and there
are many many counter examples right um
so I think let's see um say it's
exciting find for deep learning because
I think it's letting us do a lot of
interesting things it's also causing us
to rethink how we organize a companies
like build a systems team next an AI
team how we the work for for process for
for for for a product so I think there's
a lot of excitement going on um the last
thing I want to do is um you know I
found that the number one question I get
asked is
um uh how do you build a career in
machine learning right and I think um
you know when I when I did a Reddit ask
me anything a Reddit AMA that was one of
the questions that was asked even today
a few people came up to me and said you
know taking a machine learning course
the machine learning cero or something
else um what advice do you have for
building a career in machine learning I
have to admit I I don't have an amazing
answer to that but since I get asked
that so often and because I really want
to think what would be the most useful
content to you I I I thought I'll at
least attempt an answer even though it
is maybe not a great one right so this
is the last thing I had uh at the start
which is the kind of personal advice um
you know I think that um I was asking
myself this same question uh uh uh like
a couple months ago right which is you
know after you've taken a machine
learning course
um what's the next step for um
developing your machine learning career
and at that time I thought um the best
thing would be if you attend deep
learning
school so so so Sammy Peter and I got
together to do this I hope um this is
really part of motivation um and then
and then beyond that right what what are
the things that that that really help um
so I do have had actually I think all of
our organizations we've had quite a lot
of people want to move from non-machine
learning into machine learning and when
I look at the career paths um you know
one common thing is after taking these
courses to work on a project by yourself
right I've seen I have a lot of respect
for kago a lot of people actually pass
in kago and learn from the blogs there
and then and then become better and
better at it um but I want to share with
you one other thing I haven't really
shared oh by the way almost everything I
talked about today is is is new content
that I've never presented before right
so so I I don't know as I hope it worked
okay thank you
thank
you so I want to share of you really the
the I want to think of is a PhD student
process right which is you know a lot of
um uh uh people really when I was
teaching full-time at Stanford a lot of
people joined Stanford and ask me you
know how do I become a machine learning
researcher how do I have my own ideas on
how to push the bleeding edge of machine
learning and um whether you know you're
working robotics or machine learning or
or something else right there's one PhD
student process that I find has been
incredibly reliable um and um and and
I'm going to say it and you may or may
not trust it but I've seen this work so
Rel livly so many times that I hope you
take my word for it that this process
reliably turns non-machine learning
researchers into you know I very good
machine learning researchers which is um
and there's no magic really read a lot
of
papers and work on replicating results
right and I think that the human brain
is a remarkable device you know people
often ask me how do you have new ideas
and I find that um if you read enough
papers and replicate enough results you
will have new ideas on how to push for
this daily art right I I I don't know
how the I don't really I don't know how
the human brain works but I've seen this
be an incredibly reliable process read
enough papers and you know between 20
and 50 papers later and it's not one or
two it's more like 20 or maybe 50 you
will start to have your own ideas and
this has been see see sami's nodding his
head this is an incredibly reliable
process right and then my other piece of
advice is um so sometimes people ask me
what work in AI is like and I think some
people have this picture that when we
work on AI you know at BYU or Google
open AI or whatever I think some people
have this picture of us hanging out in
these um Airy you know well-lit rooms
with natural plant in the background and
we're all standing in front of a
whiteboard discussing the future of
humanity
right and all of you know working on AI
is not like that frankly almost all we
do is Dirty Work
right so one place that I've seen people
get tripped up is when they think
working on AI is that future of humanity
stuff and shy away from the dirty work
um and Dirty Work means anything from
going on The Intern internet and
downloading data and cleaning data or
downloading a piece of code and tuning
parameters to see what happens or
debugging your stack Trace to figure out
why this silly thing you know overflowed
or optimizing the database or hacking a
GPU kernel to make it faster um or
reading a paper and struggling to
replicate the result um at the end a lot
of what we do comes down to Dirty Work
and yes there are moments of inspiration
but I've seen people really stall if
they refuse to get into the dirty work
so my advice to you is
um and and actually another another
place I've seen people stall is if they
only do dirty work then then you can
become great at data cleaning but but
not also not become better and better at
having your own moments of inspiration
so one of the most reliable formulas
I've seen is really if you do both of
these you know dig into the Dirty Work
like if if your if your if your team
needs you to do some dirty work just go
and do it but in parallel read a lot of
papers and I think the combination of
these two is the most reliable formula
I've seen for producing great
researches
right so um I want to close with uh uh
uh uh uh uh just one more story about
this and I guess some of you may have
heard me talk about the the the the
Saturday story right but um for those of
you that want to advance your career and
machine learning um you know next
weekend you have a choice right um next
weekend you can either stay at home and
watch TV uh or or or or or or you could
do this right and it turns out this is
much harder and then no short-term
rewards are doing this right if next
weekend I think this weekend you guys
are all doing great
um um but next weekend if you spend next
weekend studying reading papers refering
results there are no short-term rewards
if you go to work the following Monday
your boss doesn't know what you did your
peers didn't know what you did no one's
going to patch you on the back and say
good job you spend all weekend studying
um and realistically after working
really really hard next weekend you're
not actually that much better you're
barely any better at your job so there's
pretty much no reward for working really
really hot all the next weekend um but I
think the secret to to to to advancing
your career is this if you do this not
just for one weekend but do this for
weekend after weekend for a year you
will become really good at this in fact
almost every well everyone I've worked
with at Stanford that that that that was
close and became great at at at this you
know everyone actually including me on a
gr we all spent late nights you know
hunched over like a neuronet tuning
hyper parameters trying to figure out
why it wasn't working and it was that
process of doing this not just one
weekend but weekend after weekend that
um that that allow all of us really to
to to our brains neuron networks to
learn the patterns that that that taught
us how to do this um so I hope that you
know even after this weekend you keep on
uh spending the time to keep learning
because I promise that if you do this
for long enough you will become really
really good at Deep learning um so just
to wrap up you know I'm super excited
about AI uh been making this analogy
that AI is the new electricity right and
and what I mean is that just as a 100
years ago um electricity transformed
industry after industry right
electricity transformed your agriculture
manufacturing Transportation
Communications um I feel like those of
you that are familiar with AI are now in
a amazing position to guard and
transform not just one industry but
potentially a ton of Industries so um I
guess at at at at BYU I have a fun job
trying to transform not just one
industry but multiple Industries but um
I see that uh you know it's very rare in
the history of in in in human history
where one person where someone like you
can gain the skills and do the work to
have such a huge impact on society um I
think in Silicon value the phrase change
the world is overused right you know
every every stand fit undergrass says I
want to change the world but for those
of you that work in AI I think that the
path from what you do to actually having
a big impact on a lot of people and
helping a lot of people in
transportation and healthare and
Logistics and whatever is actually
becoming clearer and clearer so so I
hope that all of you will you know uh uh
uh keep working hard even after this
weekend and and go do a bunch of cool
stuff for Humanity thank you
[Applause]
thank
you thank
you do we make any announcement Sho
we're running super late so I'll be
around later in F okay so let's break
for today and look forward to seeing
everyone tomorrow thank you