Deep Learning Basics: Introduction and Overview

O5xeyoRL95U • 2019-01-11

Transcript preview

Open

Kind: captions
Language: en
welcome everyone to 2019 it's really
good to see everybody here make it in
the cold this is 6 s 0 9 for deep
learning for self-driving cars it is
part of a series of courses on deep
learning that we're running throughout
this month the website that you can get
all the content that videos the lectures
and the code is deep learning that
mit.edu the videos and slides will be
made available there along with a github
repository that's accompanying the
course assignments for registered
students will be emailed later on in the
week and you can always contact us with
questions concerns comments at a CAI
human centered AI at mit.edu so it
starts through the basics the
fundamentals to summarize in one slide
what is deep learning it is a way to
extract useful patterns from data in an
automated way was as little human effort
involved as possible
hence the automated how the fundamental
aspect that we'll talk about a lot is
the optimization of neural networks the
practical nature that we'll provide
through the code and so on is that
there's libraries that make it
accessible and easy to do some of the
most powerful things in deep learning
using Python tensorflow and friends the
hard part always with machine learning
artificial intelligence in general is
asking good questions and getting good
data a lot of times the exciting aspects
of what's the news covers and a lot of
the exciting aspects of what is
published and that the prestigious
conferences in an archive and a blog
post is the methodology the hard part is
applying their methodology to solve real
world problems to solve fascinating
interesting problems and that requires
data that requires
asking the right questions of that data
organizing that data and labeling
selecting aspects of that data that can
reveal the answers to the questions you
ask so why has this breakthrough over
the past decade of the application of
neural networks the ideas in neural
networks what has happened what has
changed they've been around since the
1940s and ideas have been percolating
even before the digitization of
information data the ability to access
data easily in a distributed fashion
across the world all kinds of problems
have now a digital form they could be
accessed by learning algorithms Hardware
compute both the Moore's Lourdes Moore's
Law Moore's Law of CPU and GPU and Asics
Google's TPU systems Hardware that
enables the efficient effective
large-scale execution of these
algorithms community people here people
all over the world
being able to work together to talk to
each other to feed the fire of
excitement behind machine learning
github and beyond the tooling as we'll
talk about tensorflow PI torch and
everything in between that enables the a
person with an idea to reach a solution
in less and less and less time higher
and higher levels of abstraction empower
people to solve problems in less and
less time with less and less knowledge
where the idea and the data become the
central point not the effort that takes
you from an idea to the solution and
there's been a lot of exciting progress
some of which we'll talk about from face
recognition to the general problem of
scene understanding image classification
to speech text natural language
processing transcription translation in
medical applications of medical
diagnosis and cars
being able to solve many aspects of
perception in autonomous vehicles will
drivable area Lane detection object
detection digital assistance ones on
your phone and beyond the ones in your
home ads recommender systems from
Netflix to search to social Facebook and
of course deep reinforcement learning
successes in the playing of games from
board games to Starcraft and dota let's
take a step back deep learning is more
than a set of tools to solve practical
problems
Pamela Moe quartic said in 79 AI began
with the ancient wish to forge the gods
throughout our history throughout our
civilization human civilization we've
dreamed about creating echoes of
whatever is in this mind of ours in the
machine in creating living organisms
from the popular culture in the 1800s
with Frankenstein - ex machina this
vision this dream of understanding
intelligence and creating intelligence
has captivated all of us and deep
learning is at the core of that because
there's aspects of it the learning
aspects that captivate our imagination
about what is possible given data and
methodology what learning learning to
learn and beyond how far that can take
us
and here visualize is just 3% of the
neurons and one millionth of this
synapses in our own brain this
incredible structure that's in our mind
and there's only echoes of it small
shadows of it in our artificial neural
networks that were able to create but
nevertheless those echoes are inspiring
to us the history of neural networks on
this pale blue dot of ours started quite
a while ago with summers and winters
with excitements and
periods of pessimism starting in the 40s
with neural networks in the
implementation of those neural networks
as a perceptron in the 50s with ideas of
Brac propagation restricted Boltzmann
machines recurrent neural networks in
the 70s and 80s with convolutional
neural networks and the amnesty data set
with data sets
beginning to percolate in LST ends
bi-directional Aaron ends in the 90s and
the rebranding and the rebirth of neural
networks under the flag of deep learning
and deep belief nets in 2006 the birth
of image net the data said that on which
the possibilities of what deep learning
can bring to the world has been first
illustrated in the recent years in 2009
and Alex net the network that an image
net performed exactly that with a few
ideas like dropout and improved neural
networks over time every year by year
improving the performance of neural
networks in 2014 the idea of Gans the
Jana laocoon called the most exciting
idea of the last 20 years the generative
adversarial networks the ability to with
very little supervision generate data to
generate ideas after forming
representation of those it from the
understanding from the high-level
abstractions of what is extracted in the
data be able to generate new samples
create the idea of being able to create
as opposed to memorize is really
exciting and on the applied side the in
2014 with deep face the ability to do
face recognition there's been a lot of
breakthroughs on the computer vision
front that being one of them the world
was inspired captivated in 2016 with
alphago in 17 with alpha zero beating
with less and less and less effort
the best players in the in the world
that go the problem that for most of the
history of artificial intelligence
thought to be unsolvable and new ideas
with capsule networks and this year's
the year 2018 was the
year of natural language processing a
lot of interesting breakthroughs of
Google's Bert and others that will talk
about breakthroughs on ability to
understand language understand speech
and everything including generation
that's built all around that and there's
a parallel history of tooling starting
in the 60s of the perceptron and the
wiring diagrams they're ending with this
year with PI torch 1.0 intensive flow
2.0 these really solidified exciting
powerful ecosystems of tools that enable
you to do very to do a lot with very
little effort the sky is the limit
thanks to the tooling so let's then from
the big picture taken to the smallest
everything should be made as simple as
possible let's so let's start simple
with a little piece of code before we
jump into the details and a big run
through everything that is possible in
deep learning at the very basic level
with just a few lines of code really six
here six little pieces of code you can
train a neural network that understand
what's going on in an image the classic
that I will always love and this data
set
the handwriting digits where the input
to a neural network or machine learning
system is a picture of a handwritten
digit and the output is the number
that's in that digit it's as simple as
in the first step import the library
tensorflow
the second step import the data set M
this third step like Lego bricks stack
on top of each other then your own
network layer by layer with a hidden
layer an input layer and output layer
step four train the model as simple as a
single line model fit evaluate the model
in Step five on the testing data
and that's it in step six you're ready
to deploy you're ready to predict what's
in the image it's as simple as that and
much of this code obviously much more
complicated or much more elaborate and
rich and interesting and complex we'll
be making available on github on our
repository that accompanies these
courses today we'll release the first
tutorial on driver scene segmentation I
encourage everybody to go through it and
then on the tooling side in one slide
before we dive into the neural networks
and deep learning the tooling side
amongst many other things tensorflow is
a deep learning library an open source
library from google the most popular one
to date the most active with a large
ecosystem it's not just something you
import in Python and to solve some basic
problems there's an entire ecosystem of
tooling there's different levels of
api's much of what we'll do in this
course will be the highest level API
with Kerris but there's also the ability
to run in the browser with tencel ojs on
the phone with tensorflow light in the
cloud without any need to have a
computer hardware or anything any of the
libraries set up on your own machine you
can run all the code that we're
providing in the cloud with Google
collab collaboratory and the optimized
Asics hardware that Google is optimized
for tensile float with their TPU tensor
processing unit ability to visualize
tents aboard models that provide intense
attention for hub and there's just this
is an entire ecosystem including most
importantly I think documentation of
blogs that make it extremely accessible
to understand the fundamentals of the
tooling that allow you to solve the
problems from natural language
processing to computer vision to ganz
generative editorial neural networks and
everything in between with deeper
enforcement learning and so on so that
that's why we've were excited to sort of
work both in
theory in this course in this series of
lectures and in the in the tooling and
the applied side intensive flow it
really makes it exceptionally these
ideas exceptionally accessible so deep
learning at the core is the ability to
form higher and higher level of
abstractions of representations in data
and raw patterns
higher and higher levels of
understanding of patterns and those
representations are extremely important
and effective for being able to
interpret data under certain
representations data is trivial to
understand cat versus dog blue dot
versus green triangle under others it's
much more difficult in this in this task
drawing a line under polar coordinates
is trivial under Cartesian coordinates
is very difficult to well impossible to
do accurately and that's a trivial
example of a representation so our task
with deep learning with machine learning
in general is forming representations
that map the topology this the whatever
the topology the rich space of the
problem that you're trying to deal with
of the raw inputs map it in such a way
that the final representation is trivial
to work with trivial to classify trivial
to perform regression trivial to
generate new samples of that data and
that representation of higher and higher
levels of representation is really the
dream of artificial intelligence that is
what understanding is making the complex
simple like like Einstein back in a few
slides ago said and that with jurgen
schmidhuber and whoever else said it I
don't know the that's been the dream of
all of science in general of the history
of science is the history of compression
progress of forming simpler
and simpler representations of ideas the
the models of the universe of our solar
system with the earth at the center of
it is much more complex to perform to do
physics on then
a model where the Sun is at the center
of those higher and higher levels of
simple representations enable us to do
extremely powerful things that has been
the dream of science and the dream of
artificial intelligence and why deep
learning what is so special about deep
learning in the grander world of machine
learning and artificial intelligence
it's the ability to more and more remove
the input of human experts remove the
human from the picture the human costly
inefficient effort of human beings in
the picture deep learning automates much
of the extraction from the raw
gets us closer and closer to the raw
data without the need of human
involvement human expert involvement
ability to form representations from the
raw data as opposed to having a human
being need to extract features as was
done in the 80s and 90s in the early
aughts to extract features with which
then the machine learning algorithms can
work with the automated extraction of
features enables us to work with large
and larger datasets removing the human
completely except from the supervision
labeling step at the very end it doesn't
require the human expert but at the same
time there is limits to our technologies
there's always a balance between
excitement and disillusionment the
Gartner hype cycle as much as we don't
like to think about it applies to almost
every single technology of course the
magnitude of the peaks and the dross is
different but I would say we are at the
peak of inflated expectation with deep
learning and that's something we have to
think about as we talk about some of the
ideas and exciting possibility is the
future and who sell driving cars that
we'll talk about in future lectures in
this course we're the same in fact we're
little bit beyond the peak and so it's
up to us this is MIT and the engineers
and the people working on this in the
world to carry us through the draw to
carry us through the future as the ups
and downs of the excitement
progresses forward into the plateau of
productivity why else not deep learning
if we look at real world applications
especially with humanoid robotics
robotics manipulation and even yes
autonomous vehicles majority of the
aspects of the Thomas vehicles do not
involve to an extensive amount machine
learning today the problems are not
formulated as data driven learning
instead they're model-based optimization
methods that don't learn from data over
time and then from the speakers that
these fall into these couple of weeks
we'll get to see how much machine
learning starting to creep in but the
examples shown here with the Boston with
amazing humanoid robotics and Boston
Dynamics to date almost no machine
learning has been used except for
trivial perception the same with
autonomous vehicles almost no machine
learning deep learning has been used
except with perception some aspect of
enhanced perception from the visual
texture information plus what's becoming
what's starting to be used a little bit
more is the use of recurrent neural
networks to predict the future to
predict the the intent of the different
players in the scene in order to
anticipate what the future is but these
are very early steps most of the success
of EC today the 10 million miles away
Moses Eve has been attributed mostly to
non machine learning methods why why
else not deep learning
here's a really clean example of
unintended consequences of ethical
issues we have to really think about
when an algorithm learns from data based
on an objective function a loss function
the power the consequences of an
algorithm that optimizes that function
is not always obvious here's an example
of a human player playing the game of
coast runners with a as it's a boat
racing game where the task is to go
around the racetrack and try to win the
race and the objective is to get as many
points as possible there are three ways
to get points the finishing time how
long it took you to finish the finishing
position where you were in ranking and
picking up quote-unquote turbos those
little green things along the way they
give you points okay simple enough so we
design an agent in this case an RL agent
that optimizes for the rewards and what
we find on the right here the optimal
the agent discovers that the optimal
actually has nothing to do with
finishing the race or the ranking that
you can get much more points by just
focusing on the turbos and collecting
those those little green dots because
they regenerate so you go in circles
over and over and over slamming into the
wall collecting the the green turbos now
that's a very clear example of a
well-reasoned a formulated objective
function that has totally unexpected
consequences at least without sort of
considering considering those
consequences ahead of time and so that
shows the need for AI safety for a human
in the loop of machine learning that's
why not deep learning exclusively the
challenge of deep learning algorithms of
deep learning applied is to ask the
right question and understand what the
answers mean
you have to take a step back and look at
the difference the distinction the
levels degrees of what the algorithm is
accomplishing for example image
classification is not necessarily seen
understanding in fact it's very far from
scene understanding classification may
be very far from understanding and the
datasets can vary drastically across the
different benchmarks in the datasets
used the professionally done photographs
versus synthetically generated images
versus real world data and the real
world data is where the big impact is so
often times that one doesn't transfer to
the other that's the challenge of deep
learning solving all of these problems
of different lighting variations impose
variation into class variation all the
things that we take for granted human
beings with our incredible perception
system all have to be solved in order to
gain greater and greater understanding
of a scene and all the other things we
have to close the gap on that we're not
even close to yet here's an image from
the carpet under Kappa the blog from a
few years ago of former President
Obama's stepping on a scale we can
classify we can do semantic segmentation
of the scene we could do object
detection we can do a little bit of 3d
reconstruction from a video version of
the scene but well we can't do well is
all the things we take for granted we
can't tell the images in the mirrors
versus in reality as different we can't
deal with the sparsity of information
just a few pixels on President Obama's
face we can still identify on as the
president the 3d structure of the scene
that there's a foot on top of a scale
that there's human beings behind with
from a single image things we can
trivially do using all the common-sense
semantic knowledge that we have cannot
do the physics of the scene that there's
gravity the and the biggest thing the
hardest thing is what some people's
minds and what some people's minds about
what's on other people's minds
and so on mental models of the world
being able to infer what people are
thinking about be able to infer there's
been a lot of exciting work here at MIT
about what people are looking at but
we're not even close to solving that
problem either but what they're thinking
about we're not even we haven't even
begun to really think about that problem
and we do trivially as human beings and
I think at the core of that I think I'm
harboring on the visual perception
problem because it's one we take really
for granted as human beings especially
when trying to solve real world problems
especially when trying to solve
autonomous driving is we've have 540
million years of data for visual
perception so we take it for granted we
don't realize how difficult it is and we
kind of focus all our attention on this
recent development of a hundred thousand
years of abstract thought being able to
play chess being able to reason but the
visual perception is nevertheless
extremely difficult at all that every
single layer of what's required to
perceive interpret and understand the
fundamentals of a scene in a trivial way
to show that is just all the ways you
can mess with these image classification
systems by adding a little bit of noise
the last few years there's been a lot of
papers a lot of work to show that you
can mess with these systems by adding
noise here with 99% accuracy predicted
dog add a little bit of distortion you
immediately the system predicts with 99%
accuracy that's an ostrich and you can
do that kind of manipulation with just a
single pixel so the that's just a clean
way to show the gap between image
classification on an artificial data
cell like image net and real world
perception that has to be solved
especially for life critical situations
like autonomous driving I really like
this Max tegmark visualization of this
rising see that of the landscape of
human competence from Hans Moravec
and this is the
difference as we progress forward and we
discussed some of these machine learning
methods is there is the human
intelligence the general human
intelligence let's call on Stein here
that's able to generalize over all kinds
of problems over all kinds of from the
common sense to the incredibly complex
and then there is the way we've been
doing especially data-driven machine
learning which is savants which is
specialized intelligence extremely smart
at a particular task but not being able
to transfer except in the very narrow
neighborhood on this little landscape of
different of art cinematography book
writing at the peaks and chess
arithmetic and theorem proving and
vision at the at the bottom in the lake
and there's this rising sea as we saw a
problem after problem the question can
the methodology in and the approach of
deep learning of everything we're doing
now keep the sea rising or do
fundamental breakthroughs have to happen
in order to generalize and solve these
problems and so from the specialized
where the successes are the systems are
essentially boiled down to give them the
data set and given the ground truth for
that data set here's the apartment cost
in the Boston area be able to input
several parameters and based on those
parameters predict the apartment cost
that's the basic premise approach behind
the successes successful supervised deep
learning systems today if you have good
enough data that's good enough ground
truth and can be formalized we can solve
it
some of the recent promise that we will
do an entire series of lectures in the
third week on deeper enforcement
learning showed that from raw sensory
information with very little annotation
through self play weather systems learn
without human supervision are able to
perform extremely well in these
constrained context the question
of a videogame here pong two pixels
being able to perceive the raw pixels of
this pong game as raw input and learn
the fundamental quote/unquote physics of
this game understand how it is this game
behaves and how to be able to win this
game that's kind of a step toward
general purpose artificial intelligence
but it is a very small step because it's
in a simulated very trivial situation
that's the challenge that's before us
with less and less human supervision be
able to solve huge real-world problems
from the top supervised learning where
majority of the teaching is done by
human beings throughout the annotation
process through labeling all the data by
showing different examples and further
and further down to semi-supervised
learning reinforcement learning and
supervised learning removing the teacher
from the picture and making that teacher
extremely efficient when it is needed of
course data augmentation is one way so
we'll talk about so taking a small
number of examples and messing with that
set of examples augmenting that set of
examples through trivial and through
complex methods of cropping stretching
shifting and so on including through
generative networks modifying those
images to grow a small data set into a
large one to minimize to decrease
further and further the input that's a
human is the input of the human teacher
but still that's quite far away from the
incredibly efficient both teaching and
learning that humans do this is a video
and there's many of them online for the
first time I beat a human baby walking
we learn to do this you know it's one
shot learning one day you're on for all
fours and the next day you put your two
hands up
and then you figure out the rest one
shot well you can kind of ish you can
kind of play around with it but the
point is you extremely efficient with
only a few examples are able to learn
the fundamental aspect of how to solve a
particular problem machines in most
cases need thousands millions and
sometimes more examples depending on the
light critical nature of the application
the data flow of supervised learning
systems is there's input data there's a
learning system and there is output now
in the training stage for the output we
have the ground truth and so we use that
ground truth to teach the system in the
testing stage when it goes out into the
wild there's new input data over which
we have to generalize with the learning
system I'll have to make our best guess
in the training stage that the processes
with neural networks is given the input
data for which we have the ground truth
pass it through the model you get the
prediction and given that we have the
ground truth we can compare the
prediction to the ground truth look at
the error and based on that error adjust
the weights the types of predictions we
can make is regression and
classification regression is a
continuous and classification is
categorical here if we look at what a if
we look at whether the regression
problem says what is the temperature
going to be tomorrow and the
classification formulation of that
problem
says is it going to be hot or cold or
some threshold definition of what hot or
cold is that's regression and
classification now the classification
front it can be multi class which is the
the standard formulation we are tasked
with saying what is there's only a
particular entity can be only be one
thing and then there's multi-label or a
particular entity can be multiple things
and overall the input to the system can
be not just a single sample of the
to kill a dataset and the output doesn't
have to be a particular sample of the
ground truth data set it can be a
sequence sequence the sequence a single
sample to a sequence a sequence to the
sample and so on from video captioning
or it's video captioning to translation
to natural language generation to of
course the one-to-one computing to
general computer vision okay that's the
bigger picture let's step back from the
big to the small to single neuron
inspired by our own brain the biological
neural networks in our brain in the
computational block that is behind a lot
of the intelligence enough in our mind
the artificial neuron has inputs with
weights on them plus a bias and
activation function and an output it's
inspired by this thing as I showed it
before here visualizes the thelma
cortical system with three million
neurons and 476 million synapses the
full brain has a hundred billion billion
neurons and a thousand trillion synapses
ResNet and some of the other
state-of-the-art networks have in the
tens hundreds of millions of edges of
synapses the human brain has ten million
times more synapses than artificial
neural neural networks and there's other
differences the the topology is
asynchronous and not constructed in
layers the learning algorithm for
artificial neural networks is back
propagation for our biological networks
we don't know that's one of the
mysteries of the human brain there's
ideas but we really don't know the power
consumption human brains are much more
efficient than you know networks that's
one of the problems that we're trying to
solve and Asics are starting to begin to
solve some of these problems and the
stages of learning in the biological
neural networks you really never stop
learning
you're always learning always changing
both on the hardware and a software in
artificial neural networks often times
there's a training stage there's a
distinct training stage and there's a
distinct testing stage when you release
the thing in the wild online learning is
an exceptionally difficult thing that
we're still still in the very early
stages of this neuron takes a few inputs
the fundamental computational block
behind neural networks takes a few
inputs applies weights which are the
parameters that are learned sums them up
puts it into a nonlinear activation
function after adding the bias also also
learned parameter and gives an output
and the task of this neuron is to get
excited based on certain aspects of the
layers features inputs that follow
before and in that ability to
discriminate get excited by certain
things and get not excited about other
things hold a little piece of
information of whatever level of
abstraction it is so when you combine
many of them together you have knowledge
different levels of abstractions form a
knowledge base that's able to represent
understand or even act on a particular
set of raw inputs and you stack these
neurons together in layers both in width
and depth increasing further on and
there's a lot of different architectural
variants but they begin at this basic
fact that with just a single hidden
layer of a neural network the
possibilities are endless
it can approximate an any arbitrary
function adding a neural network with a
single hidden layer can approximate any
function that means any other neural
network with multiple layers and so on
is just interesting optimizations of how
we can discover those functions the
possibilities are endless and the other
aspect here is the mathematical
underpinnings of neural networks with
the weights and the differentiable
activation
are such that in a few steps from the
inputs the outputs are deeply
parallelizable and that's why the other
aspect on the compute the paralyzed
ability of neural networks is what
enables some of the exciting
advancements on the graphical processing
unit the GPUs and with a 6tp use the
ability to run across across machines
across GPU units in the very large
distributed scale to be able to train
and perform inference and yell networks
activation functions these activation
functions put together are tasks with
optimizing a loss function for
aggression that loss function is mean
squared error usually there's a lot of
areas and for classifications
cross-entropy loss in the cross entropy
loss the ground truth is 0-1 in the mean
squared error it's it's it's a real
number and so with the loss function and
the weights and the bias and the
activation function is propagating
forward to the network from the input to
the output using the loss function we
use the algorithm of Brac propagation I
wish I did an entire lecture last time
to adjust the weights to have the air
flow backwards to the network and adjust
the weights such that once again the
weights that were responsible for for
producing the correct output our
increase in the weights that we're
responsible for producing the incorrect
output or decreased the forward pass
gives you the error the backward pass
computes the gradients and based on the
gradients the optimization algorithm
combine a little learning rate adjust
the weights the learn and learning rate
is how fast the network learns and all
of this
possible on the numerical computation
side with automatic differentiation the
optimization problem given those
gradients that are computed and enough
backward flow to the network of the
gradients is to cast the gradient
descent there's a lot of variants of
this optimization algorithms that solve
various problems from dying Rayleigh
used to vanish ingredients there's a lot
of different parameters and momentum and
so on that's really just boil down to
all the different problems that are
solved with non linear optimization
mini-batch size what is the right size
of a batch or really it's called mini
batch when it's not the entire data set
to you based on which to compute the
gradients to just the learning do you do
it over a very large amount or do you do
it with stochastic gradient descent up
for every single sample of the data if
you listen to Yana kun and a lot of
recent literature is small mini batch
sizes are good
he says training with large mini batches
is bad for your health more importantly
is bad for your test error friends don't
let friends use mini batches larger than
32 larger batch size means more
computational speed because you'd have
to update the weights as often but
smaller batch size empirically produces
better generalization the problem we're
often on the broader scale of learning
trying to solve is overfitting and the
way we solve it is the regularization we
want to Train on a data set without
memorizing to an extent that you only do
well in that trained dataset so you want
it to be generalizable into future into
into into the future things that you
haven't seen yet so obviously this is a
problem for small datasets and also for
sets of parameters that you choose here
shown an example of a
curved trying to fit a particular data
versus a 90 degree polynomial trying to
fit a particular set of data with the
blue dots the ninth degree polynomial is
overfitting it does very well for that
particular set of samples but does not
generalize well in the general case and
the trade-off here is as you train
further and further at a certain point
there's a deviation between the the
error being decreased to zero on the
training set and going to one on the
test set and that's the balance we have
to strike that's done with the
validation set so you take a piece of
the training set for which you have the
ground truth and you call it the
validation set and you set it aside and
you evaluate the performance of your
system on that validation set and after
you notice that your training network is
performing poorly on the validation set
for prolonged period of time that's when
you stop that's early stoppage basically
is getting better and better and better
and then there's some period of time
there's always noise of course and after
some period of time is definitely
getting worse and that's we need to stop
there so that provides an automated way
to discovering one need to stop and
there's a lot of other regularization
methodologies of course as I mentioned
dropout is very interesting approach for
and it's variance of simply with a
certain kind of probability randomly
remove nodes in the network both the
incoming and outgoing edges randomly
throughout the training process and
there's no normalization um
normalization is obviously always
applied at the input so whenever you
have a data set as different lighting
conditions different variations they get
different sources and so on you have to
all kind of put it on the same level
ground so that we're learning the
fundamental aspects of the input data as
opposed to the
some some less relevant semantic
information like lighting mirrors and so
on so we usually always normalize for
example if it's a computer vision with
pixels from 0 to 255 you always
normalize to 0 to 1 or negative 1 to 1
or normalize based on the mean and the
standard deviation that's something you
should almost always do the thing that
enabled a lot of breakthrough
performances in the past few years is
batch normalization is performing this
kind of same normalization later on in
the network looking at the inputs to the
hidden layers and normalizing based on
the batch of data which on which your
training normalized based on the mean
and the standard deviation as batch
normalization with batch renormalization
fixes a few of the challenges which is
given that you're normalizing during the
training on the mini batches in the
training data set that doesn't directly
map to the inference station the testing
and so it allows by keeping a running
average it across both training and
testing you're able to asymptotically
approach a global normalization so this
idea across all the weights not just the
inputs across all the way to normalizes
the normalized the world in the all the
levels of abstractions the year forming
and Bachelor enormous all a lot of these
problems doing inference and there's a
lot of other ideas from layer 2 way to
instance normalization to group
normalization and you can play with a
lot of these ideas in the tensor flow
playground on playground telephone org
that I highly recommend so now let's run
through a bunch of different ideas some
of which we'll cover in future lectures
of what is all of this in this world of
deep learning from computer vision to
deeper enforcement learning to the
different small level techniques to the
large natural language processing so
convolutional neural networks the thing
that enables image classification so
these convolution of filters slide over
the image
are able to take advantage of the the
spatial invariance of visual information
that a cat in the top-left corner is the
same as features associated with cats in
the top right corner and so on images
are just a set of numbers and our task
is to take that image and produce a
classification and use the spatial in
the spatial variance of visual
information to make that to slide a
convolution filter across the image and
learn that filter as opposed to as
opposed to assigning equal value to
features that are present in various at
various regions of the image and stacked
on top of each other these convolution
filters can form high-level abstractions
of visual information and images with
alex net as i've mentioned and the image
net data set and challenge captivating
the world of what is possible with
neural networks have been further and
further improved superseding human human
performance with a special note Google
net with the inception module there's
different ideas that came along resonate
with the residual blocks and SC net most
recently so the object detection problem
is a step the next step in the visual
recognition so the image classification
is just taking the entire image is
saying what's in the image object
detection localization is saying find
all the objects of interest in the scene
and classify them the region based
methods like shown here fast our CNN
takes the image uses convolution neural
network to extract features in that
image and generate region proposals
here's a bunch of candidates that you
should look at and within those
candidates it classifies what they are
and generates a four parameters the
bounding box that the that's that thing
that captures that thing so object
detection localization ultimately boils
down to a bounding box a rectangle with
a class that's the most likely class
that's in that bounding box and you can
really summarize region based methods
as you generate the region proposal here
little pseudocode and you a full loop
over the over the region proposals and
perform detection on the on that for
loop the single shop methods remove the
for loop there's a single pass through
you had a bunch of tikka for example
here shown SSD take a pre trained neural
network that's been trained to do image
classification stack a bunch of
convolutional layers on top from each
layer extract features that are then
able to generate in a single pass
classes bounding boxes bonnie box
predictions and the class associate of
those bonnie box the trade off here and
this is where the popular yellow v12
three come from the the trade-off here
oftentimes is in performance and
accuracy so single-shot methods are are
often less performant especially on in
terms of accuracy on objects that really
far away or rather obviously they're
small in the image or really large then
the next step up in visual perception
visual understanding is semantic
segmentation that's where the tutorial
that we presented here on github is
covering semantic segmentation is the
task of now as opposed to a bounding box
or the classify the entire image or
detecting the object is a bounding box
is assigning at a pixel level the
boundaries of what the object is every
single in full scene classic for scene
segmentation classifying what every
single pixel which class that pixel
belongs to and the fundamental aspect
there's we'll cover a little bit or a
lot more on Wednesday is taking a image
classification network chopping it off
at some point and then having which is
performing the encoding step of
compressing a representation of the
scene and taking that a representation
with a decoder up sampling in a dense
way the
so taking that representation up
sampling the pixel level classification
so that up sampling there's a lot of
tricks that we'll talk through that are
interesting but it ultimately boils down
to the encoding step of forming a
representation what's going on on the
scene and then the decoding step that up
samples the pixel level annotation
classification of all the individual
pixels and as I mentioned here
the underlying idea applied most
extensively most successfully in
computer vision is transfer learning
most commonly applied way of transfer
learning is taking a pre trained your
network like ResNet and chopping it off
at some point
it's chopping off the fully connected
layer layers some aspects some parts of
the layers and then taking a data set
they a new data set and retraining that
network so what is this useful for for
every single application computer vision
in industry when you have a specific
application like you want to build
pedestrian detector if you want to build
a pedestrian detector and you have a
pedestrian data set it's useful to take
ResNet trained on imagenet or cocoa
trained in the general case of vision
perception and taking that network
chopping off some of the layers and then
retraining on your specialized
pedestrian data set and depending on how
large that data set is the sum of the
previous layers that from the
pre-training pre-trained network should
be fixed frozen and sometimes not
depending on how large the data is and
this is extremely effective in computer
vision but also in audio speech and NLP
and so as i mentioned with the pre
trained networks they are ultimately
forming representations of the data
based on which classifications the
regression is made prediction is made
but a cleanest example of this is the
auto encoder are forming representations
in an unsupervised way the output
the input is an image and the output is
that exact same image so why do we do
that
well if you add a bottleneck in the
network where there is where the network
is narrower at the in the middle than it
is on the inputs and the outputs it's
forced to compress the data down into
meaningful representation that's what
the auto encoder does your training it
to reproduce the output and reproduce it
with a latent representation that is
smaller than the original raw data and
that's a really powerful way to compress
the data it's used for removing noise
and so on but it's also just a effective
way to demonstrate a concept it can also
be used for embeddings we have a huge
amount of data and you want to form a
compressed efficient representation of
that data now in practice this is
completely unsupervised in practice if
you want to form an efficient useful
representation of the data you want to
train it in a supervised way you want to
train it on a discriminative task where
you have labelled data and the network
is trained to identify cat versus dog
that network that's trained in the
discriminative way on an annotated
supervised learning way is able to form
better representation but nevertheless
the concept stands and one way to
visualize these concepts is the the tool
that I really love projector tensorflow
org is a way to visualize these
different representations these
different embeddings you should you
should definitely play with and you can
insert your own data okay going further
and further in this direction of
unsupervised and forming representations
is generative adversarial networks from
these representations being able to
generate new data and the fundamental
methodology of of Gans is to have two
networks one is the generator one of the
discriminator and they compete against
each other in order to for the generator
to
get better and better and better at
generating realistic images the
generators tasks from noise to generate
images based on a certain representation
that are realistic and the discriminator
is the the critic that has to
discriminate between real images and
those generated by the generator and
both get better together the generator
gets better and better at generating
real images to trick the discriminator
and the discriminator gets better and
better at telling the telling the
difference in real fake until the
generator come until the generator is
able to generate some incredible things
so shown here in by the work with nvidia
i mean the the ability to generate
realistic faces as skyrocketed in the
past 3 years so this the these are
samples of celebrities photos that have
been able to generate those are all
generated by again there's ability to
generate a temporally consistent video
over time with Gans and then there's the
ability shown at the bottom right and
Nvidia I'm sure though I'm sure are else
will talk about the pixel level from
semantic segmentation being so from from
the semantic pixel segmentation on the
right being able to generate completely
the scene on the left the the all the
raw rich high-definition pixels on the
left the natural language processing
world same forming representations
forming embeddings with a war to Veck
ability to from words to form
representation that are efficiently able
to then be used to reason about the
words the whole idea of forming
representation about the data is taking
a huge you know vocabulary over a
million words you want to be able to map
it into a space where words that are far
apart from each other are in a Euclidean
sense in Euclidean distance between
words are are semantically far apart
from each other as well
so things that are similar are together
in that space and one way of doing that
with skip grams for example is looking
at a source text and turning into a
large body of text into a supervised
learning problem by learning to map
predict from the words from a particular
word to all its neighbors so train a
network on the connections that are
commonly seen in natural language and
based on those connections be able to
know which words are related to each
other now the main thing here is and I
won't get into too many details but the
the main thing here with the input
vector representing the words and the
output vector representing the
probability that those words are
connected to each other
the main thing both are thrown away in
the end the main thing is the middle the
hidden layer the low that representation
gives you the embedding that represent
these words in such a way where in the
Euclidean space the ones that are close
together semantically are semantically
together in the ones that are not are
semantically far apart and natural
language and other sequence data text
speech audio video relies on recurrent
neural networks the kernel networks are
able to learn temporal data temporal
dynamics in the data sequence data and
are able to generate sequence data the
challenge is that they're not able to
learn long-term context because when
unrolling a neural network is trained by
unrolling and doing back propagation
without any tricks the back propagation
of the gradient fades away very quickly
so you're not able to memorize the
context in a longer form of the
sentences unless there's extensions here
with with LSD mzr I use long term
dependency is captured by allowing the
network to forget information allow it
to freely pass through
information in time so what to forget
what to remember and every time decide
what to output and all of those aspects
have gates that are all trainable with
sigmoid and 10h functions bi-directional
real recurrent neural networks from the
90s is an extension often used for
providing context in both direction so
recurrent neural networks simply define
vanilla whey is learning representations
for what happened in the past now in
many cases you're able you it's not
real-time operation in that you're able
to also look into the future you look
into the data that falls out to the
sequence so benefits you do a forward
pass through the network beyond the
current and then back the encoder
decoder architecture in recurrent neural
networks used very much when the
sequence on the input and the sequence
and the output are not relied to be of
the same length that you the task is to
first with the encoder network encode
everything that's came everything on the
input sequence so this is useful for
machine translation for example so
encoding all the information the input
sequence in English and then in the
language you translating to given that
representation keep feeding it into the
decoder recurrent neural network to
generate the translation the input might
be much smaller much larger than the
output that's the encoder decoder
architecture and then there's
improvements attention is the
improvement on this encoder decoder
architecture that allows you to as
opposed to taking the input sequence
forming a representation of it and
that's it it allows you to actually look
back at different parts of the input so
not just relying in the on the single
vector representation of all the the
entire input
and a lot of excitement has been around
the idea as I mentioned some of the
dream of artificial intelligence that
machine learning in general has been to
remove the human more and more and more
from the picture to being able to
automate some of the difficult tasks so
Auto ml from Google and just the general
concept neural architecture search nas
net the ability to automate the
discovery of parameters of a neural
network a

Resume

Berikut adalah rangkuman komprehensif dan terstruktur dari transkrip video yang Anda berikan.

***

# Pengantar Deep Learning untuk Mobil Otonom: Sejarah, Konsep, dan Masa Depan AI (Kursus MIT 6.S09)

### Inti Sari (Executive Summary)
Video ini merupakan pengantar untuk kursus MIT 6.S09 yang membahas penerapan *Deep Learning* pada mobil otonom. Pembahasan mencakup evolusi historis jaringan saraf tiruan dari tahun 1940-an hingga kebangkitan era modern, konsep dasar ekstraksi pola data, serta perbandingan antara metode pembelajaran mesin tradisional dan modern. Video ini juga mengulas secara teknis arsitektur jaringan seperti CNN dan RNN, tantangan etis dalam AI, serta masa depan otomatisasi melalui AutoML.

### Poin-Poin Kunci (Key Takeaways)
*   **Definisi Deep Learning:** Metode untuk mengekstrak pola berguna dari data secara otomatis dengan usaha manusia yang minimal, melalui optimasi jaringan saraf.
*   **Pendorong Kemajuan:** Lonjakan kemajuan AI dalam dekade terakhir didorong oleh ketersediaan data digital, perkembangan hardware (GPU/TPU), kolaborasi komunitas, dan ekosistem *tooling* yang matang.
*   **Ekosistem TensorFlow:** Lebih dari sekadar pustaka, TensorFlow merupakan ekosistem lengkap yang mencakup API tingkat tinggi (Keras), dukungan mobile (Lite), web (JS), dan visualisasi (TensorBoard).
*   **Kesenjangan Pemahaman:** Meskipun mampu mengklasifikasi gambar, AI masih memiliki kesenjangan besar dalam memahami konteks fisik, teori pikiran (*theory of mind*), dan membedakan realitas vs refleksi (cermin).
*   **Arsitektur Inti:** CNN digunakan untuk penglihatan komputer (spasial), RNN/LSTM untuk data sekuensial (teks/audio), dan GAN untuk pembuatan data generatif.
*   **Tantangan Etis:** Fenomena *reward hacking* menunjukkan bagaimana AI dapat mengeksploitasi celah fungsi objektif demi poin, mengabaikan tujuan sebenarnya.
*   **Masa Depan:** Tren berkembang menuju *AutoML* dan *Neural Architecture Search* (NAS) yang mengotomatisasi desain model, menggeser peran peneliti menjadi insinyur sains data.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Pengantar, Sejarah, dan Filosofi Deep Learning
*   **Konteks Kursus:** Kuliah ini merupakan bagian dari kursus MIT 6.S09 "Deep Learning for Self-Driving Cars" tahun 2019. Sumber daya seperti video, kode, dan repositori GitHub tersedia bagi mahasiswa.
*   **Evolusi Jaringan Saraf:**
    *   **1940s - 1950s:** Ide awal dan implementasi *Perceptron*.
    *   **1970s - 1990s:** Pengembangan *Backpropagation*, RNN, CNN, dan LSTM.
    *   **2006 - 2012:** Era "Deep Learning" modern, munculnya ImageNet, dan kesuksesan AlexNet.
    *   **2016 - 2019:** AlphaGo, AlphaZero, dan dominasi NLP (Google BERT).
*   **Filosofi:** AI dipandang sebagai pemenuhan keinginan kuno umat manusia untuk "menciptakan dewa". Deep Learning berada di inti pemahaman kecerdasan dengan meniru cara kerja neuron otak secara sederhana.

#### 2. Ekosistem TensorFlow dan Konsep Abstraksi
*   **Mengapa TensorFlow?** Dipilih karena ekosistemnya yang luas, dokumentasi yang baik, dan fleksibilitas untuk berbagai aplikasi (NLP, Computer Vision, RL).
*   **Inti Deep Learning:** Membentuk abstraksi tingkat tinggi dari data mentah. Proses ini dianalogikan dengan "kompresi" dalam sejarah sains—mengubah model yang kompleks menjadi representasi yang lebih sederhana dan mudah dipahami (misal: model tata surya heliosentris vs geosentris).
*   **Perbedaan dengan ML Tradisional:** Deep Learning mengotomatisasi ekstraksi fitur dari data mentah, sedangkan ML tradisional (80-90an) membutuhkan ahli manusia untuk mengekstrak fitur secara manual.

#### 3. Hambatan Persepsi Visual dan Jenis Pembelajaran
*   **Keterbatasan AI Visi:**
    *   AI mampu melakukan klasifikasi, segmentasi semantik, dan deteksi objek.
    *   AI gagal memahami fisika (gravitasi), membedakan cermin vs kenyataan, dan memahami niat (Theory of Mind).
    *   Sistem klasifikasi rentan terhadap *noise* yang dapat mengubah hasil prediksi secara drastis.
*   **Spektrum Supervisi:**
    *   **Supervised Learning:** Membutuhkan data berlabel (Ground Truth).
    *   **Reinforcement Learning (RL):** Belajar dari sensory input tanpa anotasi melalui *self-play*, namun saat ini terbatas pada simulasi sederhana.
    *   **Tujuan:** Mengurangi ketergantungan pada supervisi manusia (menuju *unsupervised learning*) dan meniru efisiensi belajar manusia (*one-shot learning*).

#### 4. Mekanisme Neural Networks dan Proses Pelatihan
*   **Otak vs Jaringan Tiruan:** Otak manusia memiliki 100 miliar neuron dan 1.000 triliun sinapsis—jauh lebih kompleks dan efisien secara daya dibandingkan ANN. Otak belajar secara terus-menerus (*online learning*), sementara ANN memiliki tahap pelatihan dan pengujian yang terpisah.
*   **Struktur Neuron:** Input -> Bobot (Weights) + Bias -> Fungsi Aktivasi Non-linear -> Output.
*   **Proses Pelatihan:**
    *   **Forward Pass:** Menghitung prediksi dari input.
    *   **Loss Function:** Mengukur kesalahan (MSE untuk regresi, Cross-entropy untuk klasifikasi).
    *   **Backpropagation:** Mengalirkan kesalahan ke belakang untuk menghitung gradien.
    *   **Optimasi:** Menyesuaikan bobot menggunakan *Gradient Descent*.
*   **Tips Optimasi:** Ukuran *mini-batch* yang kecil (disarankan tidak lebih dari 32) lebih baik untuk kesalahan pengujian daripada *batch* besar. Teknik seperti *Dropout*, *Normalisasi*, dan *Early Stopping* digunakan untuk mencegah *overfitting*.

#### 5. Computer Vision: CNN, Deteksi Objek, dan GAN
*   **Convolutional Neural Networks (CNN):** Menggunakan filter yang meluncur di atas gambar untuk memanfaatkan invariansi spasial. Arsitektur populer mencakup AlexNet, GoogleNet (Inception), ResNet, dan SE Net.
*   **Deteksi Objek:**
    *   *Region-based (Faster R-CNN):* Akurat tapi lebih lambat (menggunakan loop proposal wilayah).
    *   *Single-shot (YOLO, SSD):* Cepat (satu kali pass), namun kadang kurang akurat untuk objek sangat kecil atau besar.
*   **Semantic Segmentation:** Klasifikasi pada tingkat piksel menggunakan arsitektur Encoder-Decoder.
*   **Transfer Learning:** Menggunakan jaringan yang telah dilatih sebelumnya (pre-trained) dan melatih ulang lapisan tertentu untuk dataset spesifik baru.
*   **GANs (Generative Adversarial Networks):** Dua jaringan (Generator dan Diskriminator) yang saling bersaing. Generator mencoba membuat data palsu yang realistis, sementara Diskriminator mencoba mendeteksi keaslian data tersebut.

#### 6. NLP, RNN, dan Masa Depan AutoML
*   **NLP & Embeddings:** Kata dipetakan ke dalam vektor di mana jarak Euclidean merepresentasikan kedekatan semantik (Word2Vec).
*   **RNN & LSTM:** Digunakan untuk data sekuensial. LSTM menggunakan *gates* untuk mengatasi masalah *vanishing gradient* dan mengingat konteks jangka panjang. *Bi-directional RNN* memberikan konteks dari masa depan dan masa lalu.
*   **Attention Mechanism:** Peningkatan pada arsitektur Encoder-Decoder yang memungkinkan model melihat kembali bagian input yang relevan, bukan hanya mengandalkan vektor konteks tunggal.

## Kesimpulan & Pesan Penutup
Kursus ini memberikan gambaran menyeluruh tentang evolusi dan penerapan Deep Learning, khususnya pada mobil otonom, mulai dari sejarah hingga arsitektur modern seperti CNN dan RNN. Meskipun telah mencapai kemajuan pesat melalui ekosistem TensorFlow dan data besar, tantangan etis serta pemahaman konteks fisik masih menjadi kendali utama yang harus diatasi. Masa depan AI akan ditentukan oleh perkembangan AutoML yang mengotomatisasi desain model, memungkinkan efisiensi yang lebih tinggi dalam pengembangan teknologi kecerdasan buatan.

Read

file updated 2026-02-13 13:25:02 UTC