Transcript

g-sndkf7mCs • Deep Learning for Speech Recognition (Adam Coates, Baidu)
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0011_g-sndkf7mCs.txt
Back Raw
Kind: captions
Language: en
so I want to tell you guys about speech
recognition and deep learning
I think deep learning has been playing
an increasingly large role in speech
recognition and one of the things I
think is most exciting about this field
is that speech recognitions at a place
right now where it's becoming good
enough to enable really exciting
applications that end up the hands of
users so for example if we want to
caption video content and make it
accessible to to everyone it used to be
that we would sort of try to do this but
you still need a human to get really
good captioning for something like a
lecture but it's possible that we can do
a lot of this with higher quality in the
future with deep learning we can do
things like hands-free interfaces in
cars make it safer to use technology
while we're on the go keep people's eyes
on the road of course and make mobile
devices home devices much easier much
more efficient and enjoyable to use but
another actually sort of fun recent
study that that some folks if I do
participated in along with Stanford and
UW is to show that for even something
straight forward that we sort of take
for granted as an application of speech
which is just texting someone with voice
or writing a piece of text the study
show you can actually go three times
faster with voice recognition systems
that are available today so it's not
just like a little bit faster now even
with the errors that a speech
recognition system can make it's
actually a lot faster and the reason I
wanted to highlight this result which is
pretty recent is that the speech engine
that was used for this study is actually
powered by a lot of the deep learning
methods and I'm going to tell you about
so hopefully when you walk away today
you have an appreciation or an
understanding of the sort of high-level
ideas that make a result like this
possible so there are a whole bunch of
different components that make up a
complete speech application so for
example there's speech transcription so
if I just talk I want to come up with
words that represent you know whatever I
just said there's also other tasks
though like word spotting or triggering
so for example if my phone is sitting
over there and I want to say
hey phone go do something for me
actually has to be listening
continuously for me to say that word and
likewise there are things like speaker
identification or verification so that
if I want to authenticate myself or I
want to be able to tell apart different
users in a room I've got to be able to
recognize your voice even though I don't
know what you're saying
so these are different tasks I'm not
going to cover all of them today instead
I'm going to just focus on the bread and
butter of speech recognition we're going
to focus on building a speech engine
that can accurately transcribe audio
into words so that's our main goal this
is a very basic goal of artificial
intelligence right historically people
are very very good at listening to
someone talk just like you guys are
listening to me right now and you can
very quickly turn words turn audio into
words and into meaning on your own
almost effortlessly and for machines
this has historically been incredibly
hard so you think of this is like one of
those sort of consummate AI tasks so the
goal of building a speech pipeline is if
you just give me a raw audio wave like
you recorded on your laptop or your cell
phone I want to somehow build a speech
recognizer that can do this very simple
task of printing out hello world when I
actually say hello world so before I dig
into the deep learning part I want to
step back a little bit and spend maybe
ten minutes talking about how a
traditional speech recognition pipeline
is working for two reasons if you're out
in the wild you're doing an internship
you're trying to build a speech
recognition system with a lot of the
tools that are out there you're going to
bump into a lot of systems that are
built on technologies that look like
this so I want you to understand a
little bit of the vocabulary and how
those things are put together and also
this will sort of give you a story for
what deep learning is doing in speech
recognition today that is kind of
special and that I think paves the way
for for much bigger results in the
future so traditional systems break the
problem
of converting an audio wave of taking
audio and break and turning it into a
transcription into a bunch of different
pieces so I'm going to start out with my
raw audio and I'm just going to
represent that by X and then usually we
have to decide on some kind of feature
representation we have to convert this
into some other form that's easier to
deal with than a raw audio wave and in a
traditional speech system I often have
something called an acoustic model and
the job of the acoustic model is to
learn the relationship between these
features that represent my audio and the
words that someone is trying to say and
then I'll often have a language model
which encapsulate Sall of my knowledge
about what kinds of words what spellings
and what combinations of words are most
likely in the language that I'm trying
to transcribe and once you have all of
these pieces so these might be these
different models might be driven by
machine learning themselves what you
would need to build in a traditional
system is something called a decoder and
the job of a decoder which itself might
involve some modeling efforts and
machine learning algorithms is to find
the sequence of words W that maximizes
this probability the probability of the
particular sequence W given your audio
that's straightforward but that's
equivalent to maximizing the product of
the contributions from your acoustic
model and from your language model so a
traditional speech system is broken down
into these pieces and a lot of the
effort and getting that system to work
is is in developing this sort of portion
that combines them all so it turns out
that if you want to just directly
transcribe audio you can't just go
straight to characters and the reason is
and it's especially apparent in English
that the way something is spelled in
characters doesn't always correspond
well to the way that it sounds so if if
I give you the word night for example
without context you don't really know
whether I'm talking about like a knight
in armor or whether I'm talking like
knight like in like an evening
and so a way to get around this to
abstract this problem away from a
traditional system is to replace this
with a sort of intermediate
representation instead of trying to
predict characters I'll just try to
predict something called phonemes so as
an example if I want to represent the
word hello what I might try to do is
break it down into these units of sound
so the first one is like the that H
sound in hello and then an a sound which
is actually only one possible
pronunciation of an e and then an L and
an O sound and that would be my string
that I try to come up with using all of
my different speech components so this
in one sense makes the modeling problem
easier my acoustic model and so on can
be simpler because I don't have to worry
about spelling but it does have this
problem that I have to think about where
these things come from
so these phonemes are intuitively
they're the perceptual e distinct units
of sound that we can use to distinguish
words and they're very approximate this
might be our imagination that these
things actually exist it's not clear how
fundamental this is but they're sort of
standardized there are a bunch of
different conventions for how to define
these and if you're and if you end up
working on a system that uses phonemes
one popular data set is called timet and
so this actually has a corpus of audio
frames with examples of each of these
phonemes so once you have this phoneme
representation unfortunately it adds
even more complexity to this traditional
pipeline because now my acoustic model
doesn't associate this audio feature
with words it actually associates them
with another kind of transcription with
the transcription into phonemes and so I
have to introduce yet another component
into my pipeline that tries to
understand how do I convert the
transcriptions in phonemes into actual
Spelling's and so I need some kind of
dick
or a lexicon to tell me all of that so
this is a way of taking our knowledge
about a language and baking it into this
engineered pipeline and then once you've
got all that again all of your work now
goes into this decoder that has a
slightly more complicated task in order
to infer the most likely word
transcription given the audio so this is
a tried and true pipeline it's been
around for a long time you'll see a
whole bunch of these systems out there
and we're still using a lot of the
vocabulary from these systems but
traditionally the big advantage is that
it's very tweakable if you want to go
add a new pronunciation for a word
you've never heard before you can just
drop it right in that's great
but it's also really hard to get working
well if you start from scratch with this
system and you have no experience in
speech recognition it's actually quite
confusing and hard to debug it's very
difficult to know which of these various
models is the one that's behind your
error and especially once we start
dealing with things like accents heavy
noise different kinds of ambiguity that
makes the problem even harder to
engineer around because trying to think
ourselves about how do i tweaked my
pronunciation model for example to
account for someone's accent that I
haven't heard that's a very hard
engineering judgment for us to make so
there are all kinds of design decisions
that go into this pipeline like choosing
the future representation for example so
the first place that deep learning has
started to make an impact in speech
recognition starting a few years ago is
to just take one of the core machine
learning components of the system and
replace it with a deep learning
algorithm so I mentioned back in this
previous pipeline that we had this
little model here whose job is to learn
the relationship between a sequence of
phonemes and the audio that we're
hearing so this is called the acoustic
model and there are lots of different
methods for training this thing so take
your favorite machine learning algorithm
you can probably find someone who is
trained in acoustic model with that
algorithm whether it's a Gaussian
mixture model or a bunch of decision
trees and random forests anything for
estimating these kinds of densities
there's a lot of work and trying to make
better acoustic models so some work by
George Dahl and co-authors took what was
a state of the art deep learning system
back in 2011 which is a deep belief
network with some pre training
strategies and dropped it into a state
of the art pipeline in place of this
acoustic model and the results are
actually pretty striking because even
though we had neural networks and these
pipelines for a while what ended up
happening is that when you replace the
Gaussian mixture model in hmm system
that already existed with this deep
belief network as an acoustic model you
actually got something between like a
ten and twenty percent relative
improvement in accuracy which is a huge
jump this is highly noticeable to a
person and if you compare this to the
amount of progress that had been made in
preceding years this is a giant leap for
a single paper to make compared to a
progress we've been able to make
previously so this is in some sense the
first generation of deep learning for
speech recognition which is I take one
of these components and I swap it out
for for my favorite deep learning
algorithm so the picture looks sort of
like this
so with these traditional speech
recognition pipelines the problem that
we would always run into is that if you
gave me a lot more data he gave me a
much bigger computer so that I could
train a huge model that actually didn't
help me because all the problems I had
were in the construction of this
pipeline and so eventually if you gave
me more data in a bigger computer the
performance of our speech recognition
system would just kind of peter out it
would just reach a ceiling that was very
hard to get over and so we just start
coming up with lots of different
strategies we start specializing for
each application we try to specialize
for each user
and try to make things a little bit
better around the edges and what these
deep learning acoustic models did was in
some sense moved that barrier a little
ways it made it possible for us to take
a bit more data much faster computers
that let us try a whole lot of models
and move that ceiling up quite a ways so
the question that many in the research
community including folks if I do have
been trying to answer is can we go to a
next-generation version of this insight
can we for instance build a speech
engine that is powered by deep learning
all the way from the audio input to the
transcription itself can we replace as
much of that traditional system with
deep learning as possible so that over
time is you give researchers more data
and bigger computers and the ability to
try more models their speech recognition
performance just keeps going up and we
can potentially solve speech for
everybody so the goal of this tutorial
is not to to get you up here which
requires a whole bunch of things that
I'll tell you about near the end but
what we want to try to do is give you
enough to get a point on this curve and
then once you're on the curve the the
idea is that what remains is now a
problem of scale it's about data and
about getting bigger computers and
coming up with ways to build bigger
models so that's my objective so that
when you walk away from here you have a
picture of what you would need to build
to get this point and then after that
it's hopefully all about scale so thanks
to Vinay Rao who's been helping put this
tutorial together there is going to be
some starter code live for the basic
pipeline the deep learning part of the
pipeline that we're talking about so
there are some open source
implementations of things like CTC but
we wanted to make sure that there's a
system out there that's pretty
representative of the acoustic models
that I'm going to be talking about in
the first half of the presentation here
so this will be enough that you can get
a simple pipeline going with something
called max Dakota
which I'll tell you about later and the
idea is that this is sort of a scale
model of the acoustic models that I do
and other places are powering real
production speech engines so this will
get you that point on the curve okay
so here's what we're going to talk about
the first part I'm just going to
introduce a few preliminaries talk about
pre-processing so we still have a little
bit of pre-processing around but it's
not really fundamental I think it's
probably going to go away in the long
run we'll talk about what is probably
the most mature piece of sequence
learning technologies for deep learning
right now so it turns out that one of
the fundamental problems of doing speech
recognition is how do I build a neural
network that can map this audio signal
to a transcription that can have a quite
variable length and so CTC is one highly
mature method for doing this and I think
you're actually going to hear about
maybe some some other solutions later
today then I'll say a little bit about
training and just what that looks like
oops and then finally say a bit about
decoding and language models which is
sort of an addendum to the current
acoustic models that we can build that
make them perform a lot better and then
once you have this that's a picture of
what you need to to get this point on
the curve and then I'll talk a little
bit about what's remaining how do you
scale up from this little scale model up
to the full thing what does what does
that actually entail and then time
permitting we'll talk a little bit about
production how could you put something
like this into a cloud server and
actually serve real users with it great
so how is audio represented this should
be pretty straightforward I think unlike
a two dimensional image where we
normally have a 2d grid of pixels audio
is just a 1d signal and there are a
bunch of different formats for audio but
typically this one-dimensional wave that
that is actually me saying something
like hello world is something like 8,000
samples per second or 16,000 samples per
second
and each wave is quantized into eight or
16 bits so when we represent this audio
signal that's going to go into our
pipeline you could just think of that as
a one dimensional vector so when I have
that box called X that represented my
audio signal you can figure this was
being broke down broken down into
samples X 1 X 2 and so forth and if I
had a one-second audio clip this vector
would have a length of either say 8,000
or 16,000 samples and each element would
be say a floating-point number that I
had extracted from this eight or 16-bit
sample this is really simple now once I
have an audio clip we'll do a little bit
of pre-processing so there are a couple
of ways to start the first is to just do
some vanilla pre-processing like convert
to a simple spectrogram so if you look
at a traditional speech pipeline you're
going to see things like M FCC's which
are mell frequency capital coefficients
you'll see a whole bunch of plays on
spectrograms where you take differences
in different kinds of features and try
to engineer complex representations but
for the stuff that we're going to do
today a simple spectrogram is just fine
and it turns out as you'll see in a
second we lose a little bit of
information when we do this but it turns
out not to not to be a huge difference
now I said a moment ago that I think
probably this is going to go away in the
long run and that's because today you
can actually find recent research and
trying to do away with even this
pre-processing part and having your
neural network process the audio wave
directly and just train its own feature
transformation so there's some
references at the end that you can look
at for this so it's a quick straw poll
how many people have seen a spectrogram
or computed a spectrogram before pretty
good maybe 50% ok so the idea behind a
spectrogram is that it's sort of like a
frequency domain representation but
instead of representing this entire
signal in terms of frequencies I'm just
going to represent a small small window
in terms of frequencies so to to process
this audio clip the first thing I'm
going to do is cut out a little window
that's typically about 20 milliseconds
long and when you get down to that scale
it's usually very clear that these audio
signals are made up of sort of a
combination of different frequencies of
sine waves and then what we do is we
compute an FFT it basically converts
this little signal into the frequency
domain and then we just take the log of
the power at each frequency and so if
you look at your what the result of this
is it basically tells us for every
frequency of sine wave what is the
magnitude what's the amount of power
represented by that sine wave that makes
up this original signal so over here in
this example we have a very strong low
frequency component in the signal and
then we have differing magnitudes at
different differing frequencies so we
can just think of this as a vector so
now instead of representing this little
20 millisecond slice as sort of a
sequence of audio samples instead I'm
going to represent it as a vector here
where each element represents sort of
the strengths of each frequency in this
little window and the next step beyond
this is that if I just told you how to
process one little window you can of
course apply this to a whole bunch of
windows across the entire piece of audio
and and that gives you what we call a
spectrogram and you can use either
disjoint windows that are just sort of
adjacent or you can apply them to
overlapping windows if you like so
there's a little bit of parameter tuning
there but this is an alternative
representation of this audio signal that
happens to be easier to use for a lot of
purposes okay so our goal starting from
this representation is to build what I'm
going to call an acoustic model but
which is really to the extent we can
make it happen is really going to be an
entire speech engine
that is represented by a neural network
so what we would like to do is build a
neural net that if we could train it
from a whole bunch of pairs X which is
my original audio that I turn into a
spectrogram and Y star that's the ground
truth transcription that some human is
given me if I were to train this big
neural network off of these pairs what
I'd like it to produce is some kind of
output that I'm representing by the
character C here so that I could later
extract the correct transcription which
I'm going to denote by Y so if I said
hello the first thing I'm going to do is
run pre-processing to get all these
spectrogram frames and then I'm going to
have a recurrent neural network that
consumes each frame and processes them
into some new representation called C
and hopefully I can engineer my network
in such a way but I can just read the
transcription off of these output
neurons so that's kind of the the
intuitive picture of what we want to
accomplish so as I mentioned back in the
outline there's one obvious fundamental
problem here which is that the length of
the input is not the same as the length
of the transcription so if I say hello
very slowly then I can have a very long
audio signal even though I didn't change
the length of the transcription or if I
say hello very quickly then I kind of
very short transcript or a very short
piece of audio and so that means that
this output of my neural network is
changing length and I need to come up
with some way to reprimand neural
network output to this fixed length
transcription and also do it in a way
that we can actually train this pipeline
so the traditional way to deal with this
problem if you were building a speech
engine several years ago is to just try
to bootstrap the whole system so I had
actually train a neural network to
correctly predict the sounds at every
frame using some kind of data set like
timet where someone has lovingly
annotated all of the phonemes for me
and then I try to figure out the
alignment between my saying hello in a
phonetic transcription with the input
audio and then once I've lined up all of
the sounds with the input audio now I
don't care about length anymore because
I can just make a one-to-one mapping
between the audio input and the phoneme
outputs that I'm trying to target but
this alignment process is horribly
error-prone you have to do a lot of
extra work to make it work well and so
we really don't want to do this we
really want to have some kind of
solution that lets us solve this
straightaway so there are multiple ways
to do it
and as I mentioned there's some current
research on how to use things like
attentional model sequence to sequence
models that you'll hear about later in
order to solve this kind of problem but
as I said we'll focus on something
called connexion connectionist temporal
classification or ctc that is sort of
current state of the art for how to do
this so here's the basic idea
so our recurrent neural network has
these output neurons that I'm calling C
and the job of these output neurons is
to encode a distribution over over the
output symbols so as because of the
structure of the recurrent Network the
length of this symbol sequence C is the
same as the length of my audio input so
if my audio inputs a was two seconds
long that might have a hundred audio
frames and that would mean that the
length of C is also a hundred a hundred
different values so if we were working
on a phoneme based model then C would be
some kind of phoning representation I
mean we would also include a blank
symbol which is special for CTC but if
as we'll do in the rest of this talk
we're trying to just predict the
graphemes trying to predict the
characters in this language directly
from the audio then I would just let C
take on a value that's in my alphabet or
take on a blank or a space if my
language has spaces in it and then the
second thing I'm going to do
sigh my RNN gives me a distribution over
these symbols see is what I'm going to
try to define some kind of mapping that
can convert this long transcription C
into the final transcription Y that's
like hello that's the actual string that
I want and now recognizing that C is
itself a probabilistic creature there's
a distribution over choices of C that
correspond to the audio once I apply
this function that also means that
there's a distribution over Y there's a
distribution over the possible
transcriptions that I could get and what
I'll want to do to train my network is
to maximize the probability of the
correct transcription given the audio so
those are the three steps that we have
to accomplish in order to make CTC work
so let's start with the first one so we
have these output neurons C and they
represent a distribution over the
different symbols that I could be
hearing in the audio so I've got some
audio signal down here you can see the
spectrogram frames poking up and this is
being processed by this recurrent neural
network and the output is a big bank of
softmax in herranz so for the first
frame of audio I have a neuron that
corresponds to each of the symbols that
C could represent and they and this set
of softmax neurons here the with the
output summing to 1 represents the
probability of say C 1 having the value
ABC and so on or this special blank
character so for example if I pick one
of the neurons over here then the first
row which it represents the character B
and the 17th column which is the 17th
frame in time this represents the
probability that C 1 7 represents the
character be given the audio so once I
have this that also means that I can
just define a distribution not just over
the
visual characters but if I just assume
that all of the characters are
independent which is kind of a naive
assumption but if I bake this into the
system I can define a distribution over
all possible sequences of characters in
this alphabet so if I gave you a
specific instance a specific character
string using this alphabet for instance
I represent the string hello as HHH e
blank e blank blank LL blank ello and
then a bunch of blanks this is a string
in this alphabet for for C and I can
just use this formula to compute the
probability of this specific sequence of
characters so that's how we we compute
the probability for a sequence of
characters when they have the same
length as the audio input so the second
step and this is in some sense the kind
of neat trick in CTC is to define a
mapping from this long encoding of the
audio into symbols that crunches it down
to the actual transcription that we're
trying to predict and the rule is this
operator takes this character sequence
and it picks up all the duplicates all
of the adjacent characters that are
repeated and discards the duplicates and
just keep some of them and then it drops
all of the blanks so in this example you
see you have three H's together so I
just keep one H and then I have a blank
I throw that away and I keep an e when I
have two L's so I keep one of the LS
over here and then another blank and an
elbow and the one key thing to note is
that when I have two characters that are
different right next to each other I
just end up keeping those two characters
in my output but if I ever have a double
character like ll in hello then I'll
need to have a blank character that that
gets put in between but if our neural
network gave me this
transcription told me that this was the
right answer we just have to apply this
operator and we get back Vic string
hello so now that we have a way to
define a distribution over these
sequences of symbols that are the same
length as the audio and we now have a
mapping from those strings into
transcriptions as I said this gives us a
probability distribution over the
possible final transcriptions so if I
look at the probability distribution
over all the different sequences of
symbols right
I might have hello written out like on
the last slide and maybe that has
probability 0.1 and then I might have
hello but written a different way with a
different by say replacing this H with a
blank that has a smaller probability and
I have a whole bunch of different
possible symbol sequences below that and
what you'll notice is that if I go
through every possible combination of
symbols here
there are several combinations that all
map to the same transcription so here's
one version of hello there's a second
version of hello there's a third version
of hello and so if I now ask what's the
probability of the transcription hello
the way that I compute that is I go
through all of the possible character
sequences that correspond to the
transcription hello and I add up all of
their probabilities so I have to sum
over all possible choices of C that
could give me that transcription in the
end so you can kind of think of this as
searching through all the possible
alignments right
I could shift these characters around a
little bit I can move them forward
backward I could expand them by adding
duplicates or squish them up depending
on how fast someone is talking and that
corresponds to every possible alignment
between the audio and the characters
that I want to transcribe it sort of
solves the problem of the variable
length and the way that I get the
probability of a specific transcription
is to sum up to
marginalize over all the different
alignments that could be feasible and
then if we have a whole bunch of other
possibilities in here like the word
yellow-eyed compute them in the same way
and so this equation just says to sum
over all the character sequences see so
that when I apply this little mapping
operator I end up with the transcription
why is oh I'm missing a EE you're
talking about this one so when we apply
this sort of squeezing operator here we
drop this double e to get a single Ian
hello so we remove all the duplicates so
the same way we did for an H right so
whenever you see two characters together
like this where they're adjacent
duplicates you sort of squeeze all those
duplicates out and you just keep one of
them but here we have a blank in between
so if we drop all the duplicates first
then we still have two L's left and then
we remove all the blanks so this gives
the algorithm a way to represent
repeated characters in the transcription
there's another one in the back
oh I see yeah this is maybe I put a
space in here really I'd have put a
space character in here instead of a
blank really this could be h-e-l-l-o H
yeah so the this space here is erroneous
okay very good
okay so once I've defined this right I
just gave you a formula to compute the
probability of a string given the audio
so as as with every good starting to a
machine learning algorithm we go and we
try to apply maximum likelihood I now
give you the correct transcription and
your job is to tune the neural network
to maximize the probability of that
transcription using this model that I
just defined so in equations what I'm
going to do is I want to maximize the
log probability of Y star for a given
example I want to maximize the
probability of the correct transcription
given the audio X and then I'm just
going to sum over all the examples and
then what I want to do is just replace
this with the equation that I had on the
last page that says in order to compute
the probability of a given transcription
I have to sum over all of the possible
symbol sequences that could have given
me that transcription sum over all the
possible alignments that would map that
transcription to my audio so Alex grades
and co-authors in 2006 actually show
that because of this independence
assumption there is a clever way there
is a dynamic programming algorithm that
can efficiently compute this summation
for you and not only commute compute
this summation so that you can compute
the objective function but actually
compute its gradient with respect to to
the output neurons of your neural
network so if you look at the paper the
algorithm details are in there
what school right now in the history of
speech and deep learning is that this is
at the level of a technology this is
something that's now implemented in a
bunch of places so that you can download
a software package that efficiently will
calculate this ctc loss function for you
that can calculate this likelihood and
can also just give you back the gradient
so I won't go into the equations here
instead I'll tell you that there are a
whole bunch of implementations on the
web that you can now use as part of deep
learning packages so one of them from
Baidu implements CTC on the GPU is
called warp CTC Stanford and group
they're actually one of Andrews students
has a CTC implementation and there's
also now CTC losses implemented in
packages like tensor flow so this is
something that's sufficiently widely
distributed that you can use use these
algorithms off the shelf so the way that
these work the way that we go about
training is we start from our audio
spectrogram we have our neural network
structure where you get to choose how
it's put together and then it outputs
this Bank of softmax neurons and then
there are pieces of off-the-shelf
software that will compute for you the
CTC cost function they'll compute this
log likelihood given a transcription and
the output neurons from your recurrent
Network and then the software will also
be able to tell you the gradient with
respect to the output neurons and once
you've got that you're set you can feed
them back into the rest of your code and
get the gradient with respect to all of
these parameters so as I said this is
all available now in sort of efficient
off-the-shelf software so you don't have
to do this work yourself so that's
pretty much all there is to the high
level algorithm with this it's actually
enough to get a sort of a working
Drosophila of speech recognition going
there are a few a few little tricks
though that you might need along the way
on easy problems you might not need
these but as you get to more
difficult datasets with a lot of noise
they can become more and more important
so the first one that we've been calling
sort of grad in the vein of all of the
grad algorithms out there is basically a
trick to help with recurrent neural
networks so it turns out that when you
try to train one of these big RNN models
on some off-the-shelf speech data one of
the things that can really get you is
seeing very long utterances early in the
process because if you have a really
long audience then if your neural
network is badly initialized you'll
often end up with things like underflow
and overflow as you try to go and
compute the probabilities and you end up
with gradients exploding as you try to
do back propagation and it can make your
optimization a real mess and it's coming
from the fact that these utterances are
really long and really hard and the
neural network just isn't ready to deal
with those transcriptions and so one of
the fixes that you can use is during the
early parts of training usually in the
first epic is you just sort all of your
audio by length and now when you process
a mini batch you just take the short
utterances first so that you're working
with really short rnns that are quite
easy to train and don't blow up and
don't have a lot of catastrophic
numerical problems and then as time goes
by you start operating on longer and
longer addresses that get more and more
difficult so we call this sort of grad
it's basically a curriculum learning
method and so you can see some work from
yoshua bengio and his team on a whole
bunch of strategies for this but you can
think of the short utterances as being
the easy ones and if you start out with
the easy utterances and move to the
longer ones your optimization algorithm
can do better so here's what an example
from one of the models that we've
trained where your CTC cost starts up
here and you know after a while you
optimize and you sort of bottom out
around you know what a log likelihood of
maybe 30 and then if you add this sort
of grad strategy after the first epic
you're actually doing better and you can
reach a better optimum than you
without it and in addition another
strategy that's extremely helpful for
recurrent networks and very deep neural
networks is batch normalization so so
this becoming very popular and it's also
available as sort of an off-the-shelf
package inside of a lot of the different
frameworks that are available today so
if you start having trouble you can
consider putting batch normalization
into your network okay so our neural
network now spits out this big bank of
softmax neurons we've got a training
algorithm we're just doing gradient
descent how do we actually get a
transcription this process as I said is
meant to be as close to characters as
possible but we still sort of need to
decode these outputs and you might think
that one simple solution which turns out
to be approximate to get the correct
transcription is just go through here
and pick the most likely sequence of
symbols for C and then apply our little
squeeze operator to get back the
transcription the way that we defined it
so this turns out not to be the optimal
thing this actually doesn't give you the
most likely transcription because it's
not accounting for the fact that every
transcription might have multiple
sequences of C's multiple alignments in
this representation but you can actually
do this and this is called the max
decoding and so for this sort of
contrived example here
I put little red dots on the most likely
C and if you see there's a couple of
blanks a couple of C's is another blank
a more blanks bees more blanks and if
you apply our little squeeze operator
you just get the word cab if you do this
it is often terrible it'll often give
you a very strange transcription that
doesn't look like English necessarily
but the reason I mention it is that this
is a really handy diagnostic that if
you're kind of wondering what's going on
in the network glancing at a few of
these will often tell you if the
network's starting to pick up any signal
or if it's just outputting gobbled
cook so I'll give you a more detailed
example in a second of how that happens
all right so these are all the concepts
of our of our very simple pipeline and
the demo code that we're going to put up
on the web will basically let you work
on all of these pieces so once we try to
train these I want to give you an
example of the sort of data that we're
training on a tanker is a ship designed
to carry large volumes of oil okay so
this is just a person sitting there
reading The Wall Street Journal to us so
this is a sort of simple data set it's
really popular in the speech research
community it's published by the
linguistic data consortium there's also
a free alternative called libera speech
that's very similar but instead of
people reading The Wall Street Journal
is people reading Creative Commons
audiobooks so in the demo code that we
have a really simple network that works
reasonably well it looks like this so
there's a sort of family of models that
we've been working with where you start
from your spectrogram you have maybe one
layer or several of convolutional
filters at the bottom and then on top of
that you have some kind of recurrent
neural network it might just be a
vanilla RNN but but you can also use
like LS TM or GRU cells any of your
favorite RNN creatures from the
literature and then on top of that we
have some fully connected layers that
produce these softmax outputs and those
are the things that go into CTC for
training so this is pretty
straightforward the implementation on
the web uses the the work CTC code and
then we would just train this big neural
network with stochastic gradient descent
Nesterov momentum all the stuff that
you've probably seen in a whole bunch of
other talks so far all right so if you
actually run this what is going on
inside so I mentioned that looking at
the max decoding is kind of a handy way
to see what's what's going on inside
this creature so I wanted to show you an
example so this is a picture this is a
visualization
those softmax neurons at the top of one
of these big neural networks so this is
the representation of see from all the
previous slides so on the horizontal
axis this is basically time this is the
frame number or which chunk of the
spectrogram we're seeing and then on the
vertical axis here you see these are all
the characters in the English alphabet
or a space or a blank so after three
hundred iterations of training which is
not very much the system has learned
something amazing which is that it
should just output blanks and spaces all
the time because these are by far
because of all the silence and things in
your data set these are the most common
characters right I just want to fill up
the whole space with blanks but you can
see it's kind of randomly poking out a
few characters here and if you run your
little Mac's decoding strategy to see
what is the system think the
transcription is it thinks it
transcription is at and so but after
three hundred iterations that's okay but
this is a sign that the neural networks
not going crazy your gradient isn't
busted it's at least learned what is the
most likely characters then after maybe
1500 or so you start to get a little bit
of structure and if you try to like
mouthed these words you might be able to
sort of see that there's some English
like sounds in here like they are just
in frightened something kind of odd but
it's actually looking much better than
just h it's actually starting to output
something go a little bit farther it's a
little bit more organized you could
start to see that we have sort of
fragments of possibly words starting to
form and then after you're getting close
to convergence it's still not a real
sentence but does this make sense to
people he guess like what the correct
transcription might be yeah so you might
have a couple of candidates the the
correct one is actually there just in
front and so you can see that sort of
it's sort of sounding it out with
English characters like I have a young
son and I kind of figure I'm eventually
going to see him producing max Dakota
puts of English and you're just going to
like sound these things that we like if
they're just in front there but but this
is why this max decoding strategy is
really handy because you can kind of
look at this output and say yeah it's
starting to get some actual signal out
of the data it's not just gobbledygook
so because this is like my favorite
speech recognition party game I wanted
to show you a few more of these so
here's the max decoded output the poor
little things cried Cynthia think of
them having been turned to the wall all
these years so you can hear like the
sound of the breath at the end turns
into a little bit of a word
Cynthia is sort of in this transcription
and you'll find that things like proper
names and so on tend to get sounded out
but if those names are not in your audio
data there's no way the network could
have learned how to say the name Cynthia
and we'll come back to how to solve that
later did you see the true label
the poor little things cried Cynthia and
that the last word is actually all these
years and there isn't a word hanging off
at the end so here's another one that is
true bad grade how many people figured
out what this is this is the max decoded
transcription sounds sounds good to you
it sounds good to me
if you told me that this was the ground
truth like oh that's weird I have to go
what lookup what this is here's the
actual true label turns out this is a
French word that means something like
rubbernecking I had no idea what this
word was so this is again the cool
examples of what these neural networks
are able to figure out with no knowledge
of the language itself okay so let's go
back to decoding we just talked about
max decoding which is sort of an
approximate way of going from these
probability vectors to a transcription Y
and if you want to find the actual most
likely transcription Y there's actually
no algorithm in general that can give
you the
perfect solution efficiently so the
reason for that remember is that for a
single transcription why I have an
efficient algorithm to compute its
probability but if I want to search over
every possible transcription I don't
know how to do that because there
combinatorially or exponentially many
possible transcriptions and I'd have to
run this algorithm to compute the
probability of all of them so we have to
resort to some kind of generic search
strategy and so one proposed in the
original paper briefly is a sort of
prefix decoding strategy so I don't want
to spend a ton of time on this instead I
want to step to sort of the next piece
of the picture so there were a bunch of
examples in there right like proper
names like Cynthia and things like but
Dow Derby where unless you had heard
this word before you have no hope of
getting it right with your neural
network and so there are lots of
examples like this in the literature of
things that are sort of spelled out
phonetically but aren't legitimate
English transcriptions and so what we'd
like to do is come up with a way to fold
in just a little bit of that knowledge
about the language that take a small
step backward from a perfect end-to-end
system and make make these
transcriptions better so as I said the
real problem here is that you don't have
enough audio available to learn all
these things if we had millions and
millions of hours of audio sitting
around you could probably learn all
these transcriptions because you just
hear enough words that you know how to
spell them all maybe the way a human
does but unfortunately we just don't
have enough audio for that so we have to
find a way to get around that data
problem there's also an example of
something that in the AI lab we've
dubbed the Tchaikovsky problem which is
that there are certain names in the
world right like proper names that if
you've never heard of it before you have
no idea how it's spelled and the only
way to know it is to have seen this word
in text before and to see it in context
so part of the purpose of these language
models is to get examples like this
correct
so there are a couple of solutions one
would be to just step back to a more
traditional pipeline right use phonemes
because then we can bake new words in
along with their phonetic pronunciation
and the system will just get it right
but in in this case I want to focus on
just fusing in a traditional language
model that gives us the probability a
priori of any sequence of words so the
reason that this is helpful is that
using a language model we can train
these things from massive text corpora
we have way way more text in the world
than we have transcribed audio and so
that makes it possible to train these
giant language models with huge
vocabulary and they can also pick up the
sort of contextual things that will tip
you off to the fact that Tchaikovsky
concerto is a reasonable thing for a
person to ask and that this particular
transcription which we have seen in the
past trike offski concerto even though
composed of legitimate English words is
is nonsense
so there's actually not much to see on
the language modeling front for this
except that the reasons for sticking
with traditional and grand models are
kind of interesting if you're excited
about speech applications so if you go
use a package like Ken LM on the web to
go build yourself a giant and Grahm
language model these are really simple
and well supported and so that makes
them easy to get working and they'll let
you train from lots of corpora but for
speech recognition in practice one of
the nice things about Engram models as
opposed to trying to say use like an RNN
model is that we can update these things
very quickly if you have a big
distributed cluster you can update that
Engram model very rapidly in parallel
from new data to keep track of whatever
the trending words are today that your
speech engine might need to deal with
and we also have the need to query this
thing very rapidly inside our decoding
loop that you'll see in just a second
and so being able to just look up the
probabilities in a table the way an
Engram model is structured is very
valuable so I hope someday all of this
will go away and be replaced with an
amazing neural network but this is the
really best practice today so in order
to fuse this into the system since to
get the most likely transcription right
probably of Y given X to maximize that
thing we need to use a generic search
algorithm anyway this opens up a door
once we're using a generic search scheme
to do our decoding and find the most
likely transcription we can add some
extra cost terms so in a previous piece
of work from Audi haneun and several
co-authors what you do is you take the
probability of a given word sequence
from your audio so this is what you
would get from your giant RNN and you
can just multiply it by some extra terms
the probability of the word sequence
according to your language model raised
to some power and then multiplied by the
length we raised to another power you
see that if you just take the log of
this objective function right then you
get the log probability that was your
original objective you get alpha times
the log probability of the language
model and beta times the log of the
length and these alpha and beta
parameters let you sort of trade-off the
importance of getting a transcription
that makes sense to your language model
versus getting a transcription that
makes sense to your acoustic model and
actually sounds like the thing that you
heard and the reason for this extra term
over here is that as you're multiplying
in all of these terms you tend to
penalize long transcriptions a bit too
much and so having a little bonus or
penalty at the end to tweak to get the
transcription length right is very
helpful so the basic idea behind this is
just to use beam search so beam search
really popular search algorithm a whole
bunch of instances of it and the rough
strategy is this so starting from time
zero starting from T equals one at the
very beginning of your audio input I
start out with an empty list that I'm
going to pop you
late with prefixes and these prefixes
are just partial transcriptions that
represent what I think I've heard so far
in the audio up to the current time and
the way that this proceeds is I'm going
to take at the current time step
each candidate prefix out of this list
and then I'm going to try all of the
possible characters in my soft max
neurons that can possibly follow it so
for example I can try adding a blank I
say if the next element of C is actually
supposed to be a blank then what that
would mean is that I don't change my
prefix right because the blanks are just
going to get dropped later but I need to
incorporate the probability of that
blank character into the probability of
this prefix right it represents one of
the ways that I could reach that prefix
and so I need to sum that probability
into that candidate and likewise
whenever I add a space to the end of a
prefix that signals that this prefix
represents the end of a word and so in
addition to adding the probability of
the space into my current estimate this
gives me the chance to go look up that
word in my language model and fold that
into my current score and then if I try
adding a new character onto this prefix
it's just straightforward I just go and
update the probabilities based on the
probability of that character and then
at the end of this I'm going to have a
huge list of possible prefixes that
could be generated and this is where you
would normally get the exponential
blow-up of trying all possible prefixes
to find the best one and what beam
search does is it just says take the que
most probable prefixes after I remove
all the duplicates in here and then go
and do this again and so if you have a
really large que then your algorithm
will be a bit more accurate in finding
the best possible solution to this
maximization problem but it'll be slower
so here's what ends up happening if you
run this decoding algorithm if you just
run it on the are n n outputs you'll see
that you
it's actually better than straight max
decoding you find slightly better
solutions but you still make things like
spelling errors like Boston with an AI
but once you add in a language model
that can actually tell you that the word
Boston with an O is much more probable
than Boston with an AI see this so one
place they can also drop in deep
learning that I wanted to mention very
rapidly is just if you're not happy with
your Engram model because it doesn't
have enough context where you've seen a
really amazing neural language modeling
paper that you'd like to fold in one
really easy way to do this and Link it
to your current pipeline is to do
rescore eeen so when this decoding
strategy finishes it can give you the
most probable transcription but it also
gives you this big list of the top K
transcriptions in terms of probability
and what you can do is to take what you
can do is take your recurrent Network
and just rescore all of these and
basically reorder them according to this
new model so in the instance of a neural
language model let's say that this is my
N best list right I have five candidates
that were output by my decoding strategy
and the first one is I'm a connoisseur
looking for wine and pork chops sounds
good to me I'm a connoisseur looking for
wine and pork shots so this is actually
quite subtle and depending on what kind
of connoisseur you are sort of up to
interpretation what you're looking for
but perhaps a neural language model is
going to be a little bit better if
figuring out that wine and port are
closely related and if you're a
connoisseur you might be looking for
wine import shots and so what you would
hope to happen is that a neural language
model trained on a bunch of text is
going to correctly reorder these things
and figure out that the second beam
candid is actually the correct one even
though your Engram model didn't help you
okay so that is really the scale model
that is the set of concepts that you
need to get a working speech recognition
engine based on deep learning and so the
thing that's left to go to
state-of-the-art performance and start
serving users is scale so I'm going to
kind of run through quickly a bunch of
the different tactics that you can use
to try to get there so the two pieces of
scale that I want to cover of course our
data and computing power where do you
get them so the first thing to know this
is just a number you can keep in the
back of your head for all purposes which
is that transcribing speech data is not
cheap but it's also not prohibitive it's
about 50 cents to a dollar a minute
depending on the quality you want and
who's transcribing it and the difficulty
of the data so typical speech benchmarks
you'll see out there maybe hundreds to
thousands of hours it's like the Liberty
speech data set is maybe hundreds of
hours there's another data set called
Vox Forge and you can kind of cobble
these together and get maybe hundreds to
thousands of hours but the real
challenge is that the application
matters a lot so all the utterances I
was playing for you are examples of read
speech people are sitting in a nice
quiet room they're reading something
wonderful to me and so I'm going to end
up with a speech engine that's really
awesome at listening to The Wall Street
Journal but maybe not so good at
listening to someone in a crowded cafe
so the application that you want to
target really needs to match your data
set and so it's worth at the outset if
you're thinking about going and buying a
bunch of speech data to think of what is
the style of speech you're actually
targeting are you worried about red
speech like the ones we're hearing or do
you care about conversational speech it
turns out that when people talk in a
conversation it when they're spontaneous
they're just coming up with what to say
on the fly versus if they have something
that they're just dictating and they
already know what to say they behave
differently and they can exhibit all of
these effects like disfluency and
stuttering
and then in addition to that we have all
kinds of environmental factors that
might matter for an application like
reverb and echo we start to care about
the quality of microphones and whether
they have noise canceling there's
something called Lombard effect that
I'll mention again in a second and of
course things like speaker accents where
you really have to think carefully about
how you collect your data to make sure
that you you actually represent the
kinds of cases you want to test on so
the reason that red speech is really
popular is because we can get a lot of
it and even if it doesn't perfectly
match your application it's cheap and
getting a lot of it can still help you
so I wanted to say a few things about
red speech because for less than ten
bucks an hour's often a lot less you can
get a whole bunch of data and it has the
disadvantage that you lose a lot of
things like inflection and conversation
allottee but but it can still be helpful
so one of the things that we've tried
doing and I'm always interested to hear
more clever schemes for this is you can
kind of engineer the way that people
read to try to get the effects that you
want so so here's one which is that if
you want a little bit more conversation
ality you want to get people out of that
kind of humdrum dictation you can start
giving them reading material that's a
little more exciting you can give them
like movie scripts and books and people
will actually start voice acting for you
creep in set the witch and see if it is
properly heated so that we can put the
bread in so these are really wonderful
workers right there like kind of really
getting into it to give you better data
the wolf is dead
the wolf is dead and danced for joy
around about the well with their mother
so yeah people reading poetry they get
this sort of lyrical quality into it
that you don't get from from just
reading The Wall Street Journal and
finally there's something called the
Lombard effect that happens when people
are in noisy environments so if you're
in like a noisy party and you're trying
to talk to you
friend who's a couple of chairs away
you'll catch yourself involuntarily
going hey over there what are you doing
you raise your inflection and you kind
of you try to use different tactics to
get your signal-to-noise ratio up you'll
sort of work around the the channel
problem and so this this is very
problematic when you're trying to do
transcription a noisy environment
because people will talk to their phones
using all these effects even though the
noise canceling and everything could
actually help them so one strategy we've
tried with varying levels of success
then they fell asleep and evening pass
but no one came to the poor children is
to actually play loud noise in people's
headphones to try to get them to elicit
this behavior again here this person is
kind of raising their voice a little bit
in a way that they wouldn't if they were
just reading and similarly as I
mentioned there are a whole bunch of
different augmentation strategies so
there are all these effects of
environment like reverberation echo
background noise that we would like our
speech engine to be robust to and one
way you could go about trying to solve
this is to go collect a bunch of audio
from those cases and then transcribe it
but but getting that raw audio is really
expensive so instead an alternative is
to take the really cheap read speech
that's very clean and use some like off
the shores off the source off the shelf
open source audio toolkit to synthesize
all the things you want to be robust to
so for example if we want to simulate
noise in a cafe here here's just me
talking to my laptop in a quiet room
hello how are you so if I'm just asking
how are you and then here's the sound of
a cafe
so I can obviously collect these
independently very cheaply then I can
synthesize this by just adding these
signals together hello how are you which
actually sounds I don't know sounds to
me like my talking to my laptop at a
Starbucks or something
and so for our work on deep speech we
actually take something like 10,000
hours of raw audio that sounds kind of
like this and then we pile on lots and
lots of audio tracks from Creative
Commons videos it turns out there's a
strange thing people upload like noise
tracks to the web that last four hours
is like really soothing to listen to the
highway or something and so you can
download all all these this free found
data and you can just overlay it on this
voice and you can synthesize perhaps
hundreds of thousands of hours of unique
audio and so the idea here is that it's
just much easier to engineer your data
pipeline to be robust than it is to
engineer the speech engine itself to be
robust so whenever you encounter an
environment that you've never seen
before and your speech engine is
breaking down you should shift your
instinct away from trying to engineer
the engine to fix it and toward this
idea of how do I reproduce it really
cheaply in my data so here's that Wall
Street Journal example again is it
designed to carry large volumes of oil
or other liquid cargo and so if I wanted
to for instance deal with a person
reading Wall Street Journal on a tanker
maybe taking a ship designed to carry
large volumes of oil or other liquid
cargo there's lots of reverb in this
room so you can't hear the reverb on the
audio but basically you know you can
synthesize these things with one line of
socks on the command line so from some
of our own work with building a large
scale speech engine with these
technologies this helps a ton and you
can actually see that when we run on
clean and noisy test utterances as we
add more and more data all the way up to
about 10,000 hours and using a lot of
these synthesis strategies we can just
steadily improve the
performance of the engine and in fact on
things like clean speech you can get
down well below 10% word error rate
which is a pretty pretty strong engine
okay let's talk about computation
because the caveat on that last slide is
yes more data will help if you have a
big enough model and big models usually
mean lots of computation so what I
haven't talked about is how big are
these neural networks and how big is one
experiment so if you actually want to
train one of these things at scale what
are you in for so here's the the back of
the envelope it's going to take at least
the number of connections in your neural
network so take one slice of that are n
n the number of unique connections
multiplied by the number of frames once
you unroll the recurrent network once
you unfold it multiplied by the number
of utterances you've got a process in
your data set
times the number of training epochs the
number of times you loop through the
data set times three because you have to
do forward propagation to flops for
every connection because there's a
multiplying and add so if you multiply
this out for some parameters from the
the deep speech engine if I do you get
something like 1.2 times 10 to the 19
flops so about 10 XO flops and if you
run this on a Titan X card this will
take about a month now if you already
know what the model is that might be
tolerable if you're you're on your epic
run to get your best performance so far
then this is okay but if you don't know
what model is going to work you're
targeting some new scenario then you
want it done now so you can try lots and
lots of models quickly so the easy fix
is just to try using a bunch more GPUs
with data parallelism and the good news
is is that so far it looks like speech
recognition allows us to use mini batch
sizes we can process enough utterances
in parallel that this is actually
efficient so you'd like to keep you know
maybe a bit more than 64 utterances on
each GPU
and up to a total mini batch size of
like a thousand or maybe two thousand
it's still useful and so if you've got
if you're putting together your your
infrastructure you can go out and you
can buy a server that'll fit eight of
these Titan GP using them and that'll
actually get you to less than a week
training time which is pretty
respectable so there are a whole bunch
of ways to use GPUs if I do we've been
using synchronous SGD it turns out that
you've got to optimize things like all
reduce code once you leave one node you
have to start worrying about your
network and if you want to keep scaling
than thinking about things like network
traffic and the right strategy for
moving all of your data becomes
important but we've had success scaling
really well all the way out to things
like 64 GPUs and just getting linear
speed ups all over the way so if you've
got a big cluster available these things
scale really well and there are a bunch
of other solutions for instance
asynchronous SGD is now kind of a
mainstay of distributed deep learning
there's also been some work recently of
trying to go back to synchronous SGD
that has a lot of nice properties but
using things like backup workers so
that's sort of the easy thing just throw
more GPUs at it and go faster one word
of warning as you're trying to build
these systems is to watch for code that
isn't as optimized as you expected it to
be and so this back of the envelope
calculation that we did of figuring out
how many flops are involved in our
network and then calculating how long it
would take to run if our GPU are running
at full efficiency you should actually
do this for your network this we call
this the speed of light this is the
fastest your code could ever run on one
GPU and if you find that you're just
drastically underperforming that number
what could be happening to you is that
you've hit a little edge case in one of
the libraries that you're using and
you're actually suffering a huge setback
that you don't need to be feeling right
now so one of the things we found back
in November is that in libraries like
Kublai's you can actually use mini batch
sizes
hit these weird catastrophic cases in
the library where you could be suffering
like a factor of two or three
performance reduction so that might take
your wonderful one-week training time
and blow it up to say a three week
training time so that's why I wanted to
go through this and ask you to keep in
mind while you're training these things
try to figure out how long it ought to
be taking and if it's going a lot slower
be suspicious that there's some code you
could be optimizing another good trick
that's particularly speech you can also
use this for other recurrent networks is
to try to keep similar length utterances
together so if you look at your data set
like a lot of things you have this sort
of distribution over possible utterance
lengths and so you see there's a whole
bunch that are you know maybe within
about 50% of each other but there's also
a large number of utterances that are
very short and so what happens is when
we want to process a whole bunch of
these uh pterence --is in parallel if we
just randomly select say a thousand
utterances to go into a mini batch
there's a high probability that we're
going to get a whole bunch of these
little short utterances along with some
really long uh pterence --is and in
order to make all the ctc libraries work
and all of our recurrent Network
computations easy what we have to do is
pad these audio signals with zero and
that lines up meaning that we're wasting
huge amounts of computation maybe a
factor of two or more and so one way to
get around it is just sort all of your
utterances by length and then try to
keep the mini-batches to be similar
lengths so that you just don't end up
with quite as much waste in each MIDI
batch and and this kind of modifies your
your algorithm a little bit but in the
end is worthwhile all right this is kind
of all I want to say about computation
if you're if you've got a few GPUs keep
an eye on your running time so that you
know what to optimize and pay attention
to the easy wins like keeping your
utterances together you can actually
scale really well and I think for a lot
of the jobs we see you can have your
your GPU running at something like 50%
of the peak and that's all in with
network time with all the bandwidth
bound stuff you can actually run a two
to three teraflops on a GPU that can
only do five teraflops in the perfect
case so what can you actually do with
this I one of my favorite results from
one of our largest models is actually in
Mandarin so we have a whole bunch of
labeled Mandarin data if I do and so one
of the things that we did was we scaled
up this model trained it on a huge
amount of Mandarin data and then as we
always do we sit down and we do error
analysis and what we would do is have a
whole bunch of humans sitting around try
to debate the transcriptions and figure
out the ground truth that tend to be
very high quality and then we go and
we'd run now a sort of holdout test on
some new people and on the speech engine
itself and so if you benchmark a single
human being against this deep speech
engine in Mandarin that's powered by all
the technologies we were just talking
about it turns out that the speech
engine can get an error rate that's down
below six percent character error rate
so only about six percent of the
characters are wrong and a single human
sitting there listening to these
transcriptions actually does quite a bit
worse it's almost ten percent if you
give people a bit of an advantage which
is you going to you now assemble a
committee of people and you get them a
fresh test set so that no one has seen
it before and we run this test again it
turns out that the two engines are that
the two cases are actually really
similar and you can end up with a
committee of native Mandarin speakers
sitting around debating no no I think
this person said this or no they have an
accent it's from the north I think
they're actually saying that and then
when you show them the deep speech
transcription they actually go ah that
that's what it was and so you can
actually get this technology up to a
point where it's highly competitive with
human beings even human beings working
together and this is sort of where I
think all the speech recognition systems
are heading thanks to deep learning and
the
technologies that we're talking about
here any questions so far
yeah go ahead yep sorry yeah so the
question is if humans have such a hard
time coming up with the correct
transcription how do you know what the
truth is and the real answer is you
don't really sometimes you might have a
little bit of user feedback but in this
instance we have very high quality
transcriptions that are coming from many
labelers teamed up with a speech engine
and so that could be wrong we do
occasionally find errors where we just
think that's a label error but when you
have a committee of humans around the
the really astonishing thing is that you
can look at the output of the speech
engines and the humans will suddenly
jump ship and say oh no no no no this
each engine is actually correct because
it'll often come up with an obscure word
or place that they weren't aware of yeah
so so this is a you know an inherently
ambiguous result but let's say that a
community of human beings tend to
disagree with another committee of human
beings about the same amount as a as a
speech engine does yeah yeah so this is
a so this is using the CTC cost right
that's really the core component of this
system it's how you deal with mapping
one variable length sequence to another
and the CTC cost is not perfect it has
this assumption of Independence baked
into the probabilistic model and because
of that assumption we're introducing
some bias into the system and for
languages like English where the
characters are obviously not independent
of each other this might be a limitation
in practice the thing that we see is
that as you add a lot of data and your
model gets much more powerful you can
still find your way around it but it
might take more data and a bigger model
than necessary
and of course we hope that all the new
state-of-the-art methods coming out of
the deep learning community are going to
give us an even better solution okay
right
empirically determined yeah so the
question is for a spectrogram with we
talked about these little spectrogram
frames being computed from 20
milliseconds of audio and is that number
special is there a reason for it so this
is really determined from years and
years of experience this is captured
from the traditional speech community we
know this works pretty well there's
actually some fun things you can do you
can take a spectrogram go back and find
the best audio that corresponds to that
spectrogram to listen to it and see if
you lost anything
and spectrograms of about this level of
quantization you can kind of tell what
people are saying it's a little bit
garbled but it's still actually pretty
good so amongst all the hyper parameters
you could choose this one's kind of a
good trade-off in keeping the
information but also saving a little bit
of the phase by doing it frequently yeah
I think in a lot of the models the in
the demo for example we don't use
overlapping windows they're just
adjacent yeah
yeah so those results are from from
in-house software it Baidu if you use
something like open MPI for example on a
cluster of GPUs actually works pretty
well on a bunch of machines but I think
some of the algorithms like all reduce
once you start moving huge amounts of
data they're not optimal you'll suffer a
hit once you start going to that many
GPUs within a single box if you use the
CUDA libraries to move data back and
forth just on a local box that stuff is
pretty well optimized and you can often
do it yourself okay
so I want to take a few more questions
at the end and maybe we can run into the
break a little bit I wanted to just dive
right through a few comments about
production here so of course the
ultimate goal of solving speech
recognition is to improve people's lives
and enable exciting products and so that
means even though so far we've trained a
bunch of acoustic and language models we
also want to get these things in
production and users tend to care about
more than just accuracy accuracy of
course matters a lot but we also care
about things like latency users want to
see the engine send them some feedback
very quickly so that they know that it's
responding and that it's understanding
what they're saying and we also need
this to be economical so that we can
serve lots of users without breaking the
bank
so in practice a lot of the neural
networks that we use in research papers
because they're awesome for beating
benchmark results turn out not to work
that well on a production engine so one
in particular that I think is worth
keeping an eye on is that it's really
common to use bi-directional recurrent
neural networks and so throughout the
talk I've been drawing my RNN with
connections that just go forward in time
but you'll see a lot of research results
that also have a pass that goes backward
in time and this works fine if you just
want to process data offline but the
problem is that if I want to compute
this neurons output up at the top of my
network
I have to wait until I see the entire
audio segment so that I can compute this
backward recurrence and get this
response so this sort of anti causal
part of my neural network that gets to
see the future means that I can't
respond to a user on the fly because I
need to wait for the end of their signal
so if you start out with these
bi-directional rnns that are actually
much easier to get working and then you
jump to using a recurrent network that
is forward only it'll turn out that
you're going to lose some accuracy and
you might kind of hope that CTC because
it doesn't care about the alignment
would somehow magically learn to shift
the output over to get better accuracy
and just artificially delay the response
so that it could get more context on its
own but it kind of turns out to only do
that a little bit in practice it's
really tough to control it and so if you
find that you're doing much worse
sometimes you have to sort of engage in
model engineering so even though I've
been talking about these recurrent
networks I want you to bear in mind that
there's this dual optimization going on
you want to find a model structure that
gives you really good accuracy but you
also have to think carefully about how
you set up the structure so that this
little neuron at the top can actually
see enough context to get an accurate
answer and and not depend too much on
the future so for example what we could
do is tweak this model so that this
neuron at the top that's trying to
output the character L and hello can see
some future frames but it doesn't have
this backward recurrence so it only gets
to see a little bit of context that lets
us kind of contain the amount of latency
in the model you skip over this so in
terms of other online aspects of course
we want this to be efficient right we
want to serve lots of users on a small
number of machines if possible and one
of the things you think you might find
if you have a really big deep neural
network or recurrent neural network is
that it's really hard to deploy them on
conventional CPUs CPUs are awesome for
or serial jobs you just want to go as
fast as you can for this one string of
instructions but as we've discovered
with so much of deep learning GPUs are
really fantastic because when we work
with neural networks we love processing
lots and lots of arithmetic in parallel
but it's really only efficient if the
batch that we're working on the hunks of
audio that we're working on are are in a
big enough batch so if we just process
one stream of audio so that my GPU is
multiplying matrices times vectors then
my GPU is going to be really inefficient
so for example unlike a K 1200 GPU this
is something you could put in a server
in the cloud what you'll find is that
you get really poor throughput
considering the the dollar value of this
Hardware if you're only processing one
piece of audio at a time whereas if you
could somehow batch up audio to have say
10 or 32 streams going at once then you
can actually squeeze out a lot more more
performance from that piece of hardware
so one of the things that we've been
working on that works really well is not
too too bad to implement is to just
batch all of the packets as data comes
in so if I have a whole bunch of users
talking to my server and they're sending
me little hundred millisecond packets of
audio what I can do is I can sit and I
can listen to all these users and when I
catch a whole batch of utterances coming
in or a whole bunch of audio packets
coming in from different people that
start around the same time I plug those
all into my GPU and I process those
matrix multiplications together so
instead of multiplying a matrix times
only one little audio piece I get to
multiply it by a batch of say four audio
pieces and it's much more efficient and
if you actually do this on a live server
and you plow a whole bunch of audio
streams through it you could support
maybe 10 20 30 users in parallel and as
the load on that server goes up I have
more and more users piling on what
happens is that the GPU will naturally
start batching up more and more packets
into single matrix multiplications so as
you get more users you actually get much
more efficient as well and so in
practice when you have a whole bunch of
users on one machine you usually don't
see matrix multiplications happening
with fewer than maybe a batch sizes of
four so the summary of all of this is
that deep learning is really making the
the first steps to building a
state-of-the-art speech engine easier
than they've ever been so if you want to
build a new state-of-the-art speech
engine for some new language all the
components that you need are things that
we've covered so far and the performance
now is really significantly driven by
data and models and I think as we were
discussing earlier I think future models
from deep learning are going to make
that influence of data and computing
power even stronger and of course data
and compute is important so that we can
try lots and lots of models and keep
making progress and I think this
technology is now at a stage where it's
not just a research system anymore we're
seeing that the end end deep learning
technologies are now mature enough that
we can get them into productions I think
you guys are going to be seeing deep
learning play a bigger bigger role in
the speech engines that are powering all
the devices that we use so thank you
very much
I think we're right at the end of time
sounds good
alright we had one in the back who's
waiting patiently go ahead more than one
voice simultaneously so the question is
how does the engine handle more than one
voice simultaneously so right now
there's nothing in this formalism that
allows you to account for multiple
speakers and so usually when you listen
to an audio clip in practice it's clear
that there's one dominant speaker and so
this beach engine of course learns
whatever it was taught from the labels
and it will try to filter out background
speakers and just transcribe the
dominant one but if it's really
ambiguous then then undefined results
you customize the transcription to the
specific characteristics of a particular
speaker so we're not doing that in these
pipelines right now but of course a lot
of different strategies have been
developed in the traditional speech
literature there are things like I've
Ector 'z that try to quantify someone's
voice and those make useful features for
improving speech engines you could also
imagine taking a lot of the concepts
like embeddings for example and tossing
them in here so I think a lot of that is
left open to future work I do a question
button I think we have to break for time
but I'll step off stage here and you
guys can come to me with your questions
thank you so much
so we'll reconvene at 2:45 for
presentation by Alex