Transcript
g-sndkf7mCs • Deep Learning for Speech Recognition (Adam Coates, Baidu)
/home/itcorpmy/itcorp.my.id/harry/yt_channel/out/lexfridman/.shards/text-0001.zst#text/0011_g-sndkf7mCs.txt
Kind: captions Language: en so I want to tell you guys about speech recognition and deep learning I think deep learning has been playing an increasingly large role in speech recognition and one of the things I think is most exciting about this field is that speech recognitions at a place right now where it's becoming good enough to enable really exciting applications that end up the hands of users so for example if we want to caption video content and make it accessible to to everyone it used to be that we would sort of try to do this but you still need a human to get really good captioning for something like a lecture but it's possible that we can do a lot of this with higher quality in the future with deep learning we can do things like hands-free interfaces in cars make it safer to use technology while we're on the go keep people's eyes on the road of course and make mobile devices home devices much easier much more efficient and enjoyable to use but another actually sort of fun recent study that that some folks if I do participated in along with Stanford and UW is to show that for even something straight forward that we sort of take for granted as an application of speech which is just texting someone with voice or writing a piece of text the study show you can actually go three times faster with voice recognition systems that are available today so it's not just like a little bit faster now even with the errors that a speech recognition system can make it's actually a lot faster and the reason I wanted to highlight this result which is pretty recent is that the speech engine that was used for this study is actually powered by a lot of the deep learning methods and I'm going to tell you about so hopefully when you walk away today you have an appreciation or an understanding of the sort of high-level ideas that make a result like this possible so there are a whole bunch of different components that make up a complete speech application so for example there's speech transcription so if I just talk I want to come up with words that represent you know whatever I just said there's also other tasks though like word spotting or triggering so for example if my phone is sitting over there and I want to say hey phone go do something for me actually has to be listening continuously for me to say that word and likewise there are things like speaker identification or verification so that if I want to authenticate myself or I want to be able to tell apart different users in a room I've got to be able to recognize your voice even though I don't know what you're saying so these are different tasks I'm not going to cover all of them today instead I'm going to just focus on the bread and butter of speech recognition we're going to focus on building a speech engine that can accurately transcribe audio into words so that's our main goal this is a very basic goal of artificial intelligence right historically people are very very good at listening to someone talk just like you guys are listening to me right now and you can very quickly turn words turn audio into words and into meaning on your own almost effortlessly and for machines this has historically been incredibly hard so you think of this is like one of those sort of consummate AI tasks so the goal of building a speech pipeline is if you just give me a raw audio wave like you recorded on your laptop or your cell phone I want to somehow build a speech recognizer that can do this very simple task of printing out hello world when I actually say hello world so before I dig into the deep learning part I want to step back a little bit and spend maybe ten minutes talking about how a traditional speech recognition pipeline is working for two reasons if you're out in the wild you're doing an internship you're trying to build a speech recognition system with a lot of the tools that are out there you're going to bump into a lot of systems that are built on technologies that look like this so I want you to understand a little bit of the vocabulary and how those things are put together and also this will sort of give you a story for what deep learning is doing in speech recognition today that is kind of special and that I think paves the way for for much bigger results in the future so traditional systems break the problem of converting an audio wave of taking audio and break and turning it into a transcription into a bunch of different pieces so I'm going to start out with my raw audio and I'm just going to represent that by X and then usually we have to decide on some kind of feature representation we have to convert this into some other form that's easier to deal with than a raw audio wave and in a traditional speech system I often have something called an acoustic model and the job of the acoustic model is to learn the relationship between these features that represent my audio and the words that someone is trying to say and then I'll often have a language model which encapsulate Sall of my knowledge about what kinds of words what spellings and what combinations of words are most likely in the language that I'm trying to transcribe and once you have all of these pieces so these might be these different models might be driven by machine learning themselves what you would need to build in a traditional system is something called a decoder and the job of a decoder which itself might involve some modeling efforts and machine learning algorithms is to find the sequence of words W that maximizes this probability the probability of the particular sequence W given your audio that's straightforward but that's equivalent to maximizing the product of the contributions from your acoustic model and from your language model so a traditional speech system is broken down into these pieces and a lot of the effort and getting that system to work is is in developing this sort of portion that combines them all so it turns out that if you want to just directly transcribe audio you can't just go straight to characters and the reason is and it's especially apparent in English that the way something is spelled in characters doesn't always correspond well to the way that it sounds so if if I give you the word night for example without context you don't really know whether I'm talking about like a knight in armor or whether I'm talking like knight like in like an evening and so a way to get around this to abstract this problem away from a traditional system is to replace this with a sort of intermediate representation instead of trying to predict characters I'll just try to predict something called phonemes so as an example if I want to represent the word hello what I might try to do is break it down into these units of sound so the first one is like the that H sound in hello and then an a sound which is actually only one possible pronunciation of an e and then an L and an O sound and that would be my string that I try to come up with using all of my different speech components so this in one sense makes the modeling problem easier my acoustic model and so on can be simpler because I don't have to worry about spelling but it does have this problem that I have to think about where these things come from so these phonemes are intuitively they're the perceptual e distinct units of sound that we can use to distinguish words and they're very approximate this might be our imagination that these things actually exist it's not clear how fundamental this is but they're sort of standardized there are a bunch of different conventions for how to define these and if you're and if you end up working on a system that uses phonemes one popular data set is called timet and so this actually has a corpus of audio frames with examples of each of these phonemes so once you have this phoneme representation unfortunately it adds even more complexity to this traditional pipeline because now my acoustic model doesn't associate this audio feature with words it actually associates them with another kind of transcription with the transcription into phonemes and so I have to introduce yet another component into my pipeline that tries to understand how do I convert the transcriptions in phonemes into actual Spelling's and so I need some kind of dick or a lexicon to tell me all of that so this is a way of taking our knowledge about a language and baking it into this engineered pipeline and then once you've got all that again all of your work now goes into this decoder that has a slightly more complicated task in order to infer the most likely word transcription given the audio so this is a tried and true pipeline it's been around for a long time you'll see a whole bunch of these systems out there and we're still using a lot of the vocabulary from these systems but traditionally the big advantage is that it's very tweakable if you want to go add a new pronunciation for a word you've never heard before you can just drop it right in that's great but it's also really hard to get working well if you start from scratch with this system and you have no experience in speech recognition it's actually quite confusing and hard to debug it's very difficult to know which of these various models is the one that's behind your error and especially once we start dealing with things like accents heavy noise different kinds of ambiguity that makes the problem even harder to engineer around because trying to think ourselves about how do i tweaked my pronunciation model for example to account for someone's accent that I haven't heard that's a very hard engineering judgment for us to make so there are all kinds of design decisions that go into this pipeline like choosing the future representation for example so the first place that deep learning has started to make an impact in speech recognition starting a few years ago is to just take one of the core machine learning components of the system and replace it with a deep learning algorithm so I mentioned back in this previous pipeline that we had this little model here whose job is to learn the relationship between a sequence of phonemes and the audio that we're hearing so this is called the acoustic model and there are lots of different methods for training this thing so take your favorite machine learning algorithm you can probably find someone who is trained in acoustic model with that algorithm whether it's a Gaussian mixture model or a bunch of decision trees and random forests anything for estimating these kinds of densities there's a lot of work and trying to make better acoustic models so some work by George Dahl and co-authors took what was a state of the art deep learning system back in 2011 which is a deep belief network with some pre training strategies and dropped it into a state of the art pipeline in place of this acoustic model and the results are actually pretty striking because even though we had neural networks and these pipelines for a while what ended up happening is that when you replace the Gaussian mixture model in hmm system that already existed with this deep belief network as an acoustic model you actually got something between like a ten and twenty percent relative improvement in accuracy which is a huge jump this is highly noticeable to a person and if you compare this to the amount of progress that had been made in preceding years this is a giant leap for a single paper to make compared to a progress we've been able to make previously so this is in some sense the first generation of deep learning for speech recognition which is I take one of these components and I swap it out for for my favorite deep learning algorithm so the picture looks sort of like this so with these traditional speech recognition pipelines the problem that we would always run into is that if you gave me a lot more data he gave me a much bigger computer so that I could train a huge model that actually didn't help me because all the problems I had were in the construction of this pipeline and so eventually if you gave me more data in a bigger computer the performance of our speech recognition system would just kind of peter out it would just reach a ceiling that was very hard to get over and so we just start coming up with lots of different strategies we start specializing for each application we try to specialize for each user and try to make things a little bit better around the edges and what these deep learning acoustic models did was in some sense moved that barrier a little ways it made it possible for us to take a bit more data much faster computers that let us try a whole lot of models and move that ceiling up quite a ways so the question that many in the research community including folks if I do have been trying to answer is can we go to a next-generation version of this insight can we for instance build a speech engine that is powered by deep learning all the way from the audio input to the transcription itself can we replace as much of that traditional system with deep learning as possible so that over time is you give researchers more data and bigger computers and the ability to try more models their speech recognition performance just keeps going up and we can potentially solve speech for everybody so the goal of this tutorial is not to to get you up here which requires a whole bunch of things that I'll tell you about near the end but what we want to try to do is give you enough to get a point on this curve and then once you're on the curve the the idea is that what remains is now a problem of scale it's about data and about getting bigger computers and coming up with ways to build bigger models so that's my objective so that when you walk away from here you have a picture of what you would need to build to get this point and then after that it's hopefully all about scale so thanks to Vinay Rao who's been helping put this tutorial together there is going to be some starter code live for the basic pipeline the deep learning part of the pipeline that we're talking about so there are some open source implementations of things like CTC but we wanted to make sure that there's a system out there that's pretty representative of the acoustic models that I'm going to be talking about in the first half of the presentation here so this will be enough that you can get a simple pipeline going with something called max Dakota which I'll tell you about later and the idea is that this is sort of a scale model of the acoustic models that I do and other places are powering real production speech engines so this will get you that point on the curve okay so here's what we're going to talk about the first part I'm just going to introduce a few preliminaries talk about pre-processing so we still have a little bit of pre-processing around but it's not really fundamental I think it's probably going to go away in the long run we'll talk about what is probably the most mature piece of sequence learning technologies for deep learning right now so it turns out that one of the fundamental problems of doing speech recognition is how do I build a neural network that can map this audio signal to a transcription that can have a quite variable length and so CTC is one highly mature method for doing this and I think you're actually going to hear about maybe some some other solutions later today then I'll say a little bit about training and just what that looks like oops and then finally say a bit about decoding and language models which is sort of an addendum to the current acoustic models that we can build that make them perform a lot better and then once you have this that's a picture of what you need to to get this point on the curve and then I'll talk a little bit about what's remaining how do you scale up from this little scale model up to the full thing what does what does that actually entail and then time permitting we'll talk a little bit about production how could you put something like this into a cloud server and actually serve real users with it great so how is audio represented this should be pretty straightforward I think unlike a two dimensional image where we normally have a 2d grid of pixels audio is just a 1d signal and there are a bunch of different formats for audio but typically this one-dimensional wave that that is actually me saying something like hello world is something like 8,000 samples per second or 16,000 samples per second and each wave is quantized into eight or 16 bits so when we represent this audio signal that's going to go into our pipeline you could just think of that as a one dimensional vector so when I have that box called X that represented my audio signal you can figure this was being broke down broken down into samples X 1 X 2 and so forth and if I had a one-second audio clip this vector would have a length of either say 8,000 or 16,000 samples and each element would be say a floating-point number that I had extracted from this eight or 16-bit sample this is really simple now once I have an audio clip we'll do a little bit of pre-processing so there are a couple of ways to start the first is to just do some vanilla pre-processing like convert to a simple spectrogram so if you look at a traditional speech pipeline you're going to see things like M FCC's which are mell frequency capital coefficients you'll see a whole bunch of plays on spectrograms where you take differences in different kinds of features and try to engineer complex representations but for the stuff that we're going to do today a simple spectrogram is just fine and it turns out as you'll see in a second we lose a little bit of information when we do this but it turns out not to not to be a huge difference now I said a moment ago that I think probably this is going to go away in the long run and that's because today you can actually find recent research and trying to do away with even this pre-processing part and having your neural network process the audio wave directly and just train its own feature transformation so there's some references at the end that you can look at for this so it's a quick straw poll how many people have seen a spectrogram or computed a spectrogram before pretty good maybe 50% ok so the idea behind a spectrogram is that it's sort of like a frequency domain representation but instead of representing this entire signal in terms of frequencies I'm just going to represent a small small window in terms of frequencies so to to process this audio clip the first thing I'm going to do is cut out a little window that's typically about 20 milliseconds long and when you get down to that scale it's usually very clear that these audio signals are made up of sort of a combination of different frequencies of sine waves and then what we do is we compute an FFT it basically converts this little signal into the frequency domain and then we just take the log of the power at each frequency and so if you look at your what the result of this is it basically tells us for every frequency of sine wave what is the magnitude what's the amount of power represented by that sine wave that makes up this original signal so over here in this example we have a very strong low frequency component in the signal and then we have differing magnitudes at different differing frequencies so we can just think of this as a vector so now instead of representing this little 20 millisecond slice as sort of a sequence of audio samples instead I'm going to represent it as a vector here where each element represents sort of the strengths of each frequency in this little window and the next step beyond this is that if I just told you how to process one little window you can of course apply this to a whole bunch of windows across the entire piece of audio and and that gives you what we call a spectrogram and you can use either disjoint windows that are just sort of adjacent or you can apply them to overlapping windows if you like so there's a little bit of parameter tuning there but this is an alternative representation of this audio signal that happens to be easier to use for a lot of purposes okay so our goal starting from this representation is to build what I'm going to call an acoustic model but which is really to the extent we can make it happen is really going to be an entire speech engine that is represented by a neural network so what we would like to do is build a neural net that if we could train it from a whole bunch of pairs X which is my original audio that I turn into a spectrogram and Y star that's the ground truth transcription that some human is given me if I were to train this big neural network off of these pairs what I'd like it to produce is some kind of output that I'm representing by the character C here so that I could later extract the correct transcription which I'm going to denote by Y so if I said hello the first thing I'm going to do is run pre-processing to get all these spectrogram frames and then I'm going to have a recurrent neural network that consumes each frame and processes them into some new representation called C and hopefully I can engineer my network in such a way but I can just read the transcription off of these output neurons so that's kind of the the intuitive picture of what we want to accomplish so as I mentioned back in the outline there's one obvious fundamental problem here which is that the length of the input is not the same as the length of the transcription so if I say hello very slowly then I can have a very long audio signal even though I didn't change the length of the transcription or if I say hello very quickly then I kind of very short transcript or a very short piece of audio and so that means that this output of my neural network is changing length and I need to come up with some way to reprimand neural network output to this fixed length transcription and also do it in a way that we can actually train this pipeline so the traditional way to deal with this problem if you were building a speech engine several years ago is to just try to bootstrap the whole system so I had actually train a neural network to correctly predict the sounds at every frame using some kind of data set like timet where someone has lovingly annotated all of the phonemes for me and then I try to figure out the alignment between my saying hello in a phonetic transcription with the input audio and then once I've lined up all of the sounds with the input audio now I don't care about length anymore because I can just make a one-to-one mapping between the audio input and the phoneme outputs that I'm trying to target but this alignment process is horribly error-prone you have to do a lot of extra work to make it work well and so we really don't want to do this we really want to have some kind of solution that lets us solve this straightaway so there are multiple ways to do it and as I mentioned there's some current research on how to use things like attentional model sequence to sequence models that you'll hear about later in order to solve this kind of problem but as I said we'll focus on something called connexion connectionist temporal classification or ctc that is sort of current state of the art for how to do this so here's the basic idea so our recurrent neural network has these output neurons that I'm calling C and the job of these output neurons is to encode a distribution over over the output symbols so as because of the structure of the recurrent Network the length of this symbol sequence C is the same as the length of my audio input so if my audio inputs a was two seconds long that might have a hundred audio frames and that would mean that the length of C is also a hundred a hundred different values so if we were working on a phoneme based model then C would be some kind of phoning representation I mean we would also include a blank symbol which is special for CTC but if as we'll do in the rest of this talk we're trying to just predict the graphemes trying to predict the characters in this language directly from the audio then I would just let C take on a value that's in my alphabet or take on a blank or a space if my language has spaces in it and then the second thing I'm going to do sigh my RNN gives me a distribution over these symbols see is what I'm going to try to define some kind of mapping that can convert this long transcription C into the final transcription Y that's like hello that's the actual string that I want and now recognizing that C is itself a probabilistic creature there's a distribution over choices of C that correspond to the audio once I apply this function that also means that there's a distribution over Y there's a distribution over the possible transcriptions that I could get and what I'll want to do to train my network is to maximize the probability of the correct transcription given the audio so those are the three steps that we have to accomplish in order to make CTC work so let's start with the first one so we have these output neurons C and they represent a distribution over the different symbols that I could be hearing in the audio so I've got some audio signal down here you can see the spectrogram frames poking up and this is being processed by this recurrent neural network and the output is a big bank of softmax in herranz so for the first frame of audio I have a neuron that corresponds to each of the symbols that C could represent and they and this set of softmax neurons here the with the output summing to 1 represents the probability of say C 1 having the value ABC and so on or this special blank character so for example if I pick one of the neurons over here then the first row which it represents the character B and the 17th column which is the 17th frame in time this represents the probability that C 1 7 represents the character be given the audio so once I have this that also means that I can just define a distribution not just over the visual characters but if I just assume that all of the characters are independent which is kind of a naive assumption but if I bake this into the system I can define a distribution over all possible sequences of characters in this alphabet so if I gave you a specific instance a specific character string using this alphabet for instance I represent the string hello as HHH e blank e blank blank LL blank ello and then a bunch of blanks this is a string in this alphabet for for C and I can just use this formula to compute the probability of this specific sequence of characters so that's how we we compute the probability for a sequence of characters when they have the same length as the audio input so the second step and this is in some sense the kind of neat trick in CTC is to define a mapping from this long encoding of the audio into symbols that crunches it down to the actual transcription that we're trying to predict and the rule is this operator takes this character sequence and it picks up all the duplicates all of the adjacent characters that are repeated and discards the duplicates and just keep some of them and then it drops all of the blanks so in this example you see you have three H's together so I just keep one H and then I have a blank I throw that away and I keep an e when I have two L's so I keep one of the LS over here and then another blank and an elbow and the one key thing to note is that when I have two characters that are different right next to each other I just end up keeping those two characters in my output but if I ever have a double character like ll in hello then I'll need to have a blank character that that gets put in between but if our neural network gave me this transcription told me that this was the right answer we just have to apply this operator and we get back Vic string hello so now that we have a way to define a distribution over these sequences of symbols that are the same length as the audio and we now have a mapping from those strings into transcriptions as I said this gives us a probability distribution over the possible final transcriptions so if I look at the probability distribution over all the different sequences of symbols right I might have hello written out like on the last slide and maybe that has probability 0.1 and then I might have hello but written a different way with a different by say replacing this H with a blank that has a smaller probability and I have a whole bunch of different possible symbol sequences below that and what you'll notice is that if I go through every possible combination of symbols here there are several combinations that all map to the same transcription so here's one version of hello there's a second version of hello there's a third version of hello and so if I now ask what's the probability of the transcription hello the way that I compute that is I go through all of the possible character sequences that correspond to the transcription hello and I add up all of their probabilities so I have to sum over all possible choices of C that could give me that transcription in the end so you can kind of think of this as searching through all the possible alignments right I could shift these characters around a little bit I can move them forward backward I could expand them by adding duplicates or squish them up depending on how fast someone is talking and that corresponds to every possible alignment between the audio and the characters that I want to transcribe it sort of solves the problem of the variable length and the way that I get the probability of a specific transcription is to sum up to marginalize over all the different alignments that could be feasible and then if we have a whole bunch of other possibilities in here like the word yellow-eyed compute them in the same way and so this equation just says to sum over all the character sequences see so that when I apply this little mapping operator I end up with the transcription why is oh I'm missing a EE you're talking about this one so when we apply this sort of squeezing operator here we drop this double e to get a single Ian hello so we remove all the duplicates so the same way we did for an H right so whenever you see two characters together like this where they're adjacent duplicates you sort of squeeze all those duplicates out and you just keep one of them but here we have a blank in between so if we drop all the duplicates first then we still have two L's left and then we remove all the blanks so this gives the algorithm a way to represent repeated characters in the transcription there's another one in the back oh I see yeah this is maybe I put a space in here really I'd have put a space character in here instead of a blank really this could be h-e-l-l-o H yeah so the this space here is erroneous okay very good okay so once I've defined this right I just gave you a formula to compute the probability of a string given the audio so as as with every good starting to a machine learning algorithm we go and we try to apply maximum likelihood I now give you the correct transcription and your job is to tune the neural network to maximize the probability of that transcription using this model that I just defined so in equations what I'm going to do is I want to maximize the log probability of Y star for a given example I want to maximize the probability of the correct transcription given the audio X and then I'm just going to sum over all the examples and then what I want to do is just replace this with the equation that I had on the last page that says in order to compute the probability of a given transcription I have to sum over all of the possible symbol sequences that could have given me that transcription sum over all the possible alignments that would map that transcription to my audio so Alex grades and co-authors in 2006 actually show that because of this independence assumption there is a clever way there is a dynamic programming algorithm that can efficiently compute this summation for you and not only commute compute this summation so that you can compute the objective function but actually compute its gradient with respect to to the output neurons of your neural network so if you look at the paper the algorithm details are in there what school right now in the history of speech and deep learning is that this is at the level of a technology this is something that's now implemented in a bunch of places so that you can download a software package that efficiently will calculate this ctc loss function for you that can calculate this likelihood and can also just give you back the gradient so I won't go into the equations here instead I'll tell you that there are a whole bunch of implementations on the web that you can now use as part of deep learning packages so one of them from Baidu implements CTC on the GPU is called warp CTC Stanford and group they're actually one of Andrews students has a CTC implementation and there's also now CTC losses implemented in packages like tensor flow so this is something that's sufficiently widely distributed that you can use use these algorithms off the shelf so the way that these work the way that we go about training is we start from our audio spectrogram we have our neural network structure where you get to choose how it's put together and then it outputs this Bank of softmax neurons and then there are pieces of off-the-shelf software that will compute for you the CTC cost function they'll compute this log likelihood given a transcription and the output neurons from your recurrent Network and then the software will also be able to tell you the gradient with respect to the output neurons and once you've got that you're set you can feed them back into the rest of your code and get the gradient with respect to all of these parameters so as I said this is all available now in sort of efficient off-the-shelf software so you don't have to do this work yourself so that's pretty much all there is to the high level algorithm with this it's actually enough to get a sort of a working Drosophila of speech recognition going there are a few a few little tricks though that you might need along the way on easy problems you might not need these but as you get to more difficult datasets with a lot of noise they can become more and more important so the first one that we've been calling sort of grad in the vein of all of the grad algorithms out there is basically a trick to help with recurrent neural networks so it turns out that when you try to train one of these big RNN models on some off-the-shelf speech data one of the things that can really get you is seeing very long utterances early in the process because if you have a really long audience then if your neural network is badly initialized you'll often end up with things like underflow and overflow as you try to go and compute the probabilities and you end up with gradients exploding as you try to do back propagation and it can make your optimization a real mess and it's coming from the fact that these utterances are really long and really hard and the neural network just isn't ready to deal with those transcriptions and so one of the fixes that you can use is during the early parts of training usually in the first epic is you just sort all of your audio by length and now when you process a mini batch you just take the short utterances first so that you're working with really short rnns that are quite easy to train and don't blow up and don't have a lot of catastrophic numerical problems and then as time goes by you start operating on longer and longer addresses that get more and more difficult so we call this sort of grad it's basically a curriculum learning method and so you can see some work from yoshua bengio and his team on a whole bunch of strategies for this but you can think of the short utterances as being the easy ones and if you start out with the easy utterances and move to the longer ones your optimization algorithm can do better so here's what an example from one of the models that we've trained where your CTC cost starts up here and you know after a while you optimize and you sort of bottom out around you know what a log likelihood of maybe 30 and then if you add this sort of grad strategy after the first epic you're actually doing better and you can reach a better optimum than you without it and in addition another strategy that's extremely helpful for recurrent networks and very deep neural networks is batch normalization so so this becoming very popular and it's also available as sort of an off-the-shelf package inside of a lot of the different frameworks that are available today so if you start having trouble you can consider putting batch normalization into your network okay so our neural network now spits out this big bank of softmax neurons we've got a training algorithm we're just doing gradient descent how do we actually get a transcription this process as I said is meant to be as close to characters as possible but we still sort of need to decode these outputs and you might think that one simple solution which turns out to be approximate to get the correct transcription is just go through here and pick the most likely sequence of symbols for C and then apply our little squeeze operator to get back the transcription the way that we defined it so this turns out not to be the optimal thing this actually doesn't give you the most likely transcription because it's not accounting for the fact that every transcription might have multiple sequences of C's multiple alignments in this representation but you can actually do this and this is called the max decoding and so for this sort of contrived example here I put little red dots on the most likely C and if you see there's a couple of blanks a couple of C's is another blank a more blanks bees more blanks and if you apply our little squeeze operator you just get the word cab if you do this it is often terrible it'll often give you a very strange transcription that doesn't look like English necessarily but the reason I mention it is that this is a really handy diagnostic that if you're kind of wondering what's going on in the network glancing at a few of these will often tell you if the network's starting to pick up any signal or if it's just outputting gobbled cook so I'll give you a more detailed example in a second of how that happens all right so these are all the concepts of our of our very simple pipeline and the demo code that we're going to put up on the web will basically let you work on all of these pieces so once we try to train these I want to give you an example of the sort of data that we're training on a tanker is a ship designed to carry large volumes of oil okay so this is just a person sitting there reading The Wall Street Journal to us so this is a sort of simple data set it's really popular in the speech research community it's published by the linguistic data consortium there's also a free alternative called libera speech that's very similar but instead of people reading The Wall Street Journal is people reading Creative Commons audiobooks so in the demo code that we have a really simple network that works reasonably well it looks like this so there's a sort of family of models that we've been working with where you start from your spectrogram you have maybe one layer or several of convolutional filters at the bottom and then on top of that you have some kind of recurrent neural network it might just be a vanilla RNN but but you can also use like LS TM or GRU cells any of your favorite RNN creatures from the literature and then on top of that we have some fully connected layers that produce these softmax outputs and those are the things that go into CTC for training so this is pretty straightforward the implementation on the web uses the the work CTC code and then we would just train this big neural network with stochastic gradient descent Nesterov momentum all the stuff that you've probably seen in a whole bunch of other talks so far all right so if you actually run this what is going on inside so I mentioned that looking at the max decoding is kind of a handy way to see what's what's going on inside this creature so I wanted to show you an example so this is a picture this is a visualization those softmax neurons at the top of one of these big neural networks so this is the representation of see from all the previous slides so on the horizontal axis this is basically time this is the frame number or which chunk of the spectrogram we're seeing and then on the vertical axis here you see these are all the characters in the English alphabet or a space or a blank so after three hundred iterations of training which is not very much the system has learned something amazing which is that it should just output blanks and spaces all the time because these are by far because of all the silence and things in your data set these are the most common characters right I just want to fill up the whole space with blanks but you can see it's kind of randomly poking out a few characters here and if you run your little Mac's decoding strategy to see what is the system think the transcription is it thinks it transcription is at and so but after three hundred iterations that's okay but this is a sign that the neural networks not going crazy your gradient isn't busted it's at least learned what is the most likely characters then after maybe 1500 or so you start to get a little bit of structure and if you try to like mouthed these words you might be able to sort of see that there's some English like sounds in here like they are just in frightened something kind of odd but it's actually looking much better than just h it's actually starting to output something go a little bit farther it's a little bit more organized you could start to see that we have sort of fragments of possibly words starting to form and then after you're getting close to convergence it's still not a real sentence but does this make sense to people he guess like what the correct transcription might be yeah so you might have a couple of candidates the the correct one is actually there just in front and so you can see that sort of it's sort of sounding it out with English characters like I have a young son and I kind of figure I'm eventually going to see him producing max Dakota puts of English and you're just going to like sound these things that we like if they're just in front there but but this is why this max decoding strategy is really handy because you can kind of look at this output and say yeah it's starting to get some actual signal out of the data it's not just gobbledygook so because this is like my favorite speech recognition party game I wanted to show you a few more of these so here's the max decoded output the poor little things cried Cynthia think of them having been turned to the wall all these years so you can hear like the sound of the breath at the end turns into a little bit of a word Cynthia is sort of in this transcription and you'll find that things like proper names and so on tend to get sounded out but if those names are not in your audio data there's no way the network could have learned how to say the name Cynthia and we'll come back to how to solve that later did you see the true label the poor little things cried Cynthia and that the last word is actually all these years and there isn't a word hanging off at the end so here's another one that is true bad grade how many people figured out what this is this is the max decoded transcription sounds sounds good to you it sounds good to me if you told me that this was the ground truth like oh that's weird I have to go what lookup what this is here's the actual true label turns out this is a French word that means something like rubbernecking I had no idea what this word was so this is again the cool examples of what these neural networks are able to figure out with no knowledge of the language itself okay so let's go back to decoding we just talked about max decoding which is sort of an approximate way of going from these probability vectors to a transcription Y and if you want to find the actual most likely transcription Y there's actually no algorithm in general that can give you the perfect solution efficiently so the reason for that remember is that for a single transcription why I have an efficient algorithm to compute its probability but if I want to search over every possible transcription I don't know how to do that because there combinatorially or exponentially many possible transcriptions and I'd have to run this algorithm to compute the probability of all of them so we have to resort to some kind of generic search strategy and so one proposed in the original paper briefly is a sort of prefix decoding strategy so I don't want to spend a ton of time on this instead I want to step to sort of the next piece of the picture so there were a bunch of examples in there right like proper names like Cynthia and things like but Dow Derby where unless you had heard this word before you have no hope of getting it right with your neural network and so there are lots of examples like this in the literature of things that are sort of spelled out phonetically but aren't legitimate English transcriptions and so what we'd like to do is come up with a way to fold in just a little bit of that knowledge about the language that take a small step backward from a perfect end-to-end system and make make these transcriptions better so as I said the real problem here is that you don't have enough audio available to learn all these things if we had millions and millions of hours of audio sitting around you could probably learn all these transcriptions because you just hear enough words that you know how to spell them all maybe the way a human does but unfortunately we just don't have enough audio for that so we have to find a way to get around that data problem there's also an example of something that in the AI lab we've dubbed the Tchaikovsky problem which is that there are certain names in the world right like proper names that if you've never heard of it before you have no idea how it's spelled and the only way to know it is to have seen this word in text before and to see it in context so part of the purpose of these language models is to get examples like this correct so there are a couple of solutions one would be to just step back to a more traditional pipeline right use phonemes because then we can bake new words in along with their phonetic pronunciation and the system will just get it right but in in this case I want to focus on just fusing in a traditional language model that gives us the probability a priori of any sequence of words so the reason that this is helpful is that using a language model we can train these things from massive text corpora we have way way more text in the world than we have transcribed audio and so that makes it possible to train these giant language models with huge vocabulary and they can also pick up the sort of contextual things that will tip you off to the fact that Tchaikovsky concerto is a reasonable thing for a person to ask and that this particular transcription which we have seen in the past trike offski concerto even though composed of legitimate English words is is nonsense so there's actually not much to see on the language modeling front for this except that the reasons for sticking with traditional and grand models are kind of interesting if you're excited about speech applications so if you go use a package like Ken LM on the web to go build yourself a giant and Grahm language model these are really simple and well supported and so that makes them easy to get working and they'll let you train from lots of corpora but for speech recognition in practice one of the nice things about Engram models as opposed to trying to say use like an RNN model is that we can update these things very quickly if you have a big distributed cluster you can update that Engram model very rapidly in parallel from new data to keep track of whatever the trending words are today that your speech engine might need to deal with and we also have the need to query this thing very rapidly inside our decoding loop that you'll see in just a second and so being able to just look up the probabilities in a table the way an Engram model is structured is very valuable so I hope someday all of this will go away and be replaced with an amazing neural network but this is the really best practice today so in order to fuse this into the system since to get the most likely transcription right probably of Y given X to maximize that thing we need to use a generic search algorithm anyway this opens up a door once we're using a generic search scheme to do our decoding and find the most likely transcription we can add some extra cost terms so in a previous piece of work from Audi haneun and several co-authors what you do is you take the probability of a given word sequence from your audio so this is what you would get from your giant RNN and you can just multiply it by some extra terms the probability of the word sequence according to your language model raised to some power and then multiplied by the length we raised to another power you see that if you just take the log of this objective function right then you get the log probability that was your original objective you get alpha times the log probability of the language model and beta times the log of the length and these alpha and beta parameters let you sort of trade-off the importance of getting a transcription that makes sense to your language model versus getting a transcription that makes sense to your acoustic model and actually sounds like the thing that you heard and the reason for this extra term over here is that as you're multiplying in all of these terms you tend to penalize long transcriptions a bit too much and so having a little bonus or penalty at the end to tweak to get the transcription length right is very helpful so the basic idea behind this is just to use beam search so beam search really popular search algorithm a whole bunch of instances of it and the rough strategy is this so starting from time zero starting from T equals one at the very beginning of your audio input I start out with an empty list that I'm going to pop you late with prefixes and these prefixes are just partial transcriptions that represent what I think I've heard so far in the audio up to the current time and the way that this proceeds is I'm going to take at the current time step each candidate prefix out of this list and then I'm going to try all of the possible characters in my soft max neurons that can possibly follow it so for example I can try adding a blank I say if the next element of C is actually supposed to be a blank then what that would mean is that I don't change my prefix right because the blanks are just going to get dropped later but I need to incorporate the probability of that blank character into the probability of this prefix right it represents one of the ways that I could reach that prefix and so I need to sum that probability into that candidate and likewise whenever I add a space to the end of a prefix that signals that this prefix represents the end of a word and so in addition to adding the probability of the space into my current estimate this gives me the chance to go look up that word in my language model and fold that into my current score and then if I try adding a new character onto this prefix it's just straightforward I just go and update the probabilities based on the probability of that character and then at the end of this I'm going to have a huge list of possible prefixes that could be generated and this is where you would normally get the exponential blow-up of trying all possible prefixes to find the best one and what beam search does is it just says take the que most probable prefixes after I remove all the duplicates in here and then go and do this again and so if you have a really large que then your algorithm will be a bit more accurate in finding the best possible solution to this maximization problem but it'll be slower so here's what ends up happening if you run this decoding algorithm if you just run it on the are n n outputs you'll see that you it's actually better than straight max decoding you find slightly better solutions but you still make things like spelling errors like Boston with an AI but once you add in a language model that can actually tell you that the word Boston with an O is much more probable than Boston with an AI see this so one place they can also drop in deep learning that I wanted to mention very rapidly is just if you're not happy with your Engram model because it doesn't have enough context where you've seen a really amazing neural language modeling paper that you'd like to fold in one really easy way to do this and Link it to your current pipeline is to do rescore eeen so when this decoding strategy finishes it can give you the most probable transcription but it also gives you this big list of the top K transcriptions in terms of probability and what you can do is to take what you can do is take your recurrent Network and just rescore all of these and basically reorder them according to this new model so in the instance of a neural language model let's say that this is my N best list right I have five candidates that were output by my decoding strategy and the first one is I'm a connoisseur looking for wine and pork chops sounds good to me I'm a connoisseur looking for wine and pork shots so this is actually quite subtle and depending on what kind of connoisseur you are sort of up to interpretation what you're looking for but perhaps a neural language model is going to be a little bit better if figuring out that wine and port are closely related and if you're a connoisseur you might be looking for wine import shots and so what you would hope to happen is that a neural language model trained on a bunch of text is going to correctly reorder these things and figure out that the second beam candid is actually the correct one even though your Engram model didn't help you okay so that is really the scale model that is the set of concepts that you need to get a working speech recognition engine based on deep learning and so the thing that's left to go to state-of-the-art performance and start serving users is scale so I'm going to kind of run through quickly a bunch of the different tactics that you can use to try to get there so the two pieces of scale that I want to cover of course our data and computing power where do you get them so the first thing to know this is just a number you can keep in the back of your head for all purposes which is that transcribing speech data is not cheap but it's also not prohibitive it's about 50 cents to a dollar a minute depending on the quality you want and who's transcribing it and the difficulty of the data so typical speech benchmarks you'll see out there maybe hundreds to thousands of hours it's like the Liberty speech data set is maybe hundreds of hours there's another data set called Vox Forge and you can kind of cobble these together and get maybe hundreds to thousands of hours but the real challenge is that the application matters a lot so all the utterances I was playing for you are examples of read speech people are sitting in a nice quiet room they're reading something wonderful to me and so I'm going to end up with a speech engine that's really awesome at listening to The Wall Street Journal but maybe not so good at listening to someone in a crowded cafe so the application that you want to target really needs to match your data set and so it's worth at the outset if you're thinking about going and buying a bunch of speech data to think of what is the style of speech you're actually targeting are you worried about red speech like the ones we're hearing or do you care about conversational speech it turns out that when people talk in a conversation it when they're spontaneous they're just coming up with what to say on the fly versus if they have something that they're just dictating and they already know what to say they behave differently and they can exhibit all of these effects like disfluency and stuttering and then in addition to that we have all kinds of environmental factors that might matter for an application like reverb and echo we start to care about the quality of microphones and whether they have noise canceling there's something called Lombard effect that I'll mention again in a second and of course things like speaker accents where you really have to think carefully about how you collect your data to make sure that you you actually represent the kinds of cases you want to test on so the reason that red speech is really popular is because we can get a lot of it and even if it doesn't perfectly match your application it's cheap and getting a lot of it can still help you so I wanted to say a few things about red speech because for less than ten bucks an hour's often a lot less you can get a whole bunch of data and it has the disadvantage that you lose a lot of things like inflection and conversation allottee but but it can still be helpful so one of the things that we've tried doing and I'm always interested to hear more clever schemes for this is you can kind of engineer the way that people read to try to get the effects that you want so so here's one which is that if you want a little bit more conversation ality you want to get people out of that kind of humdrum dictation you can start giving them reading material that's a little more exciting you can give them like movie scripts and books and people will actually start voice acting for you creep in set the witch and see if it is properly heated so that we can put the bread in so these are really wonderful workers right there like kind of really getting into it to give you better data the wolf is dead the wolf is dead and danced for joy around about the well with their mother so yeah people reading poetry they get this sort of lyrical quality into it that you don't get from from just reading The Wall Street Journal and finally there's something called the Lombard effect that happens when people are in noisy environments so if you're in like a noisy party and you're trying to talk to you friend who's a couple of chairs away you'll catch yourself involuntarily going hey over there what are you doing you raise your inflection and you kind of you try to use different tactics to get your signal-to-noise ratio up you'll sort of work around the the channel problem and so this this is very problematic when you're trying to do transcription a noisy environment because people will talk to their phones using all these effects even though the noise canceling and everything could actually help them so one strategy we've tried with varying levels of success then they fell asleep and evening pass but no one came to the poor children is to actually play loud noise in people's headphones to try to get them to elicit this behavior again here this person is kind of raising their voice a little bit in a way that they wouldn't if they were just reading and similarly as I mentioned there are a whole bunch of different augmentation strategies so there are all these effects of environment like reverberation echo background noise that we would like our speech engine to be robust to and one way you could go about trying to solve this is to go collect a bunch of audio from those cases and then transcribe it but but getting that raw audio is really expensive so instead an alternative is to take the really cheap read speech that's very clean and use some like off the shores off the source off the shelf open source audio toolkit to synthesize all the things you want to be robust to so for example if we want to simulate noise in a cafe here here's just me talking to my laptop in a quiet room hello how are you so if I'm just asking how are you and then here's the sound of a cafe so I can obviously collect these independently very cheaply then I can synthesize this by just adding these signals together hello how are you which actually sounds I don't know sounds to me like my talking to my laptop at a Starbucks or something and so for our work on deep speech we actually take something like 10,000 hours of raw audio that sounds kind of like this and then we pile on lots and lots of audio tracks from Creative Commons videos it turns out there's a strange thing people upload like noise tracks to the web that last four hours is like really soothing to listen to the highway or something and so you can download all all these this free found data and you can just overlay it on this voice and you can synthesize perhaps hundreds of thousands of hours of unique audio and so the idea here is that it's just much easier to engineer your data pipeline to be robust than it is to engineer the speech engine itself to be robust so whenever you encounter an environment that you've never seen before and your speech engine is breaking down you should shift your instinct away from trying to engineer the engine to fix it and toward this idea of how do I reproduce it really cheaply in my data so here's that Wall Street Journal example again is it designed to carry large volumes of oil or other liquid cargo and so if I wanted to for instance deal with a person reading Wall Street Journal on a tanker maybe taking a ship designed to carry large volumes of oil or other liquid cargo there's lots of reverb in this room so you can't hear the reverb on the audio but basically you know you can synthesize these things with one line of socks on the command line so from some of our own work with building a large scale speech engine with these technologies this helps a ton and you can actually see that when we run on clean and noisy test utterances as we add more and more data all the way up to about 10,000 hours and using a lot of these synthesis strategies we can just steadily improve the performance of the engine and in fact on things like clean speech you can get down well below 10% word error rate which is a pretty pretty strong engine okay let's talk about computation because the caveat on that last slide is yes more data will help if you have a big enough model and big models usually mean lots of computation so what I haven't talked about is how big are these neural networks and how big is one experiment so if you actually want to train one of these things at scale what are you in for so here's the the back of the envelope it's going to take at least the number of connections in your neural network so take one slice of that are n n the number of unique connections multiplied by the number of frames once you unroll the recurrent network once you unfold it multiplied by the number of utterances you've got a process in your data set times the number of training epochs the number of times you loop through the data set times three because you have to do forward propagation to flops for every connection because there's a multiplying and add so if you multiply this out for some parameters from the the deep speech engine if I do you get something like 1.2 times 10 to the 19 flops so about 10 XO flops and if you run this on a Titan X card this will take about a month now if you already know what the model is that might be tolerable if you're you're on your epic run to get your best performance so far then this is okay but if you don't know what model is going to work you're targeting some new scenario then you want it done now so you can try lots and lots of models quickly so the easy fix is just to try using a bunch more GPUs with data parallelism and the good news is is that so far it looks like speech recognition allows us to use mini batch sizes we can process enough utterances in parallel that this is actually efficient so you'd like to keep you know maybe a bit more than 64 utterances on each GPU and up to a total mini batch size of like a thousand or maybe two thousand it's still useful and so if you've got if you're putting together your your infrastructure you can go out and you can buy a server that'll fit eight of these Titan GP using them and that'll actually get you to less than a week training time which is pretty respectable so there are a whole bunch of ways to use GPUs if I do we've been using synchronous SGD it turns out that you've got to optimize things like all reduce code once you leave one node you have to start worrying about your network and if you want to keep scaling than thinking about things like network traffic and the right strategy for moving all of your data becomes important but we've had success scaling really well all the way out to things like 64 GPUs and just getting linear speed ups all over the way so if you've got a big cluster available these things scale really well and there are a bunch of other solutions for instance asynchronous SGD is now kind of a mainstay of distributed deep learning there's also been some work recently of trying to go back to synchronous SGD that has a lot of nice properties but using things like backup workers so that's sort of the easy thing just throw more GPUs at it and go faster one word of warning as you're trying to build these systems is to watch for code that isn't as optimized as you expected it to be and so this back of the envelope calculation that we did of figuring out how many flops are involved in our network and then calculating how long it would take to run if our GPU are running at full efficiency you should actually do this for your network this we call this the speed of light this is the fastest your code could ever run on one GPU and if you find that you're just drastically underperforming that number what could be happening to you is that you've hit a little edge case in one of the libraries that you're using and you're actually suffering a huge setback that you don't need to be feeling right now so one of the things we found back in November is that in libraries like Kublai's you can actually use mini batch sizes hit these weird catastrophic cases in the library where you could be suffering like a factor of two or three performance reduction so that might take your wonderful one-week training time and blow it up to say a three week training time so that's why I wanted to go through this and ask you to keep in mind while you're training these things try to figure out how long it ought to be taking and if it's going a lot slower be suspicious that there's some code you could be optimizing another good trick that's particularly speech you can also use this for other recurrent networks is to try to keep similar length utterances together so if you look at your data set like a lot of things you have this sort of distribution over possible utterance lengths and so you see there's a whole bunch that are you know maybe within about 50% of each other but there's also a large number of utterances that are very short and so what happens is when we want to process a whole bunch of these uh pterence --is in parallel if we just randomly select say a thousand utterances to go into a mini batch there's a high probability that we're going to get a whole bunch of these little short utterances along with some really long uh pterence --is and in order to make all the ctc libraries work and all of our recurrent Network computations easy what we have to do is pad these audio signals with zero and that lines up meaning that we're wasting huge amounts of computation maybe a factor of two or more and so one way to get around it is just sort all of your utterances by length and then try to keep the mini-batches to be similar lengths so that you just don't end up with quite as much waste in each MIDI batch and and this kind of modifies your your algorithm a little bit but in the end is worthwhile all right this is kind of all I want to say about computation if you're if you've got a few GPUs keep an eye on your running time so that you know what to optimize and pay attention to the easy wins like keeping your utterances together you can actually scale really well and I think for a lot of the jobs we see you can have your your GPU running at something like 50% of the peak and that's all in with network time with all the bandwidth bound stuff you can actually run a two to three teraflops on a GPU that can only do five teraflops in the perfect case so what can you actually do with this I one of my favorite results from one of our largest models is actually in Mandarin so we have a whole bunch of labeled Mandarin data if I do and so one of the things that we did was we scaled up this model trained it on a huge amount of Mandarin data and then as we always do we sit down and we do error analysis and what we would do is have a whole bunch of humans sitting around try to debate the transcriptions and figure out the ground truth that tend to be very high quality and then we go and we'd run now a sort of holdout test on some new people and on the speech engine itself and so if you benchmark a single human being against this deep speech engine in Mandarin that's powered by all the technologies we were just talking about it turns out that the speech engine can get an error rate that's down below six percent character error rate so only about six percent of the characters are wrong and a single human sitting there listening to these transcriptions actually does quite a bit worse it's almost ten percent if you give people a bit of an advantage which is you going to you now assemble a committee of people and you get them a fresh test set so that no one has seen it before and we run this test again it turns out that the two engines are that the two cases are actually really similar and you can end up with a committee of native Mandarin speakers sitting around debating no no I think this person said this or no they have an accent it's from the north I think they're actually saying that and then when you show them the deep speech transcription they actually go ah that that's what it was and so you can actually get this technology up to a point where it's highly competitive with human beings even human beings working together and this is sort of where I think all the speech recognition systems are heading thanks to deep learning and the technologies that we're talking about here any questions so far yeah go ahead yep sorry yeah so the question is if humans have such a hard time coming up with the correct transcription how do you know what the truth is and the real answer is you don't really sometimes you might have a little bit of user feedback but in this instance we have very high quality transcriptions that are coming from many labelers teamed up with a speech engine and so that could be wrong we do occasionally find errors where we just think that's a label error but when you have a committee of humans around the the really astonishing thing is that you can look at the output of the speech engines and the humans will suddenly jump ship and say oh no no no no this each engine is actually correct because it'll often come up with an obscure word or place that they weren't aware of yeah so so this is a you know an inherently ambiguous result but let's say that a community of human beings tend to disagree with another committee of human beings about the same amount as a as a speech engine does yeah yeah so this is a so this is using the CTC cost right that's really the core component of this system it's how you deal with mapping one variable length sequence to another and the CTC cost is not perfect it has this assumption of Independence baked into the probabilistic model and because of that assumption we're introducing some bias into the system and for languages like English where the characters are obviously not independent of each other this might be a limitation in practice the thing that we see is that as you add a lot of data and your model gets much more powerful you can still find your way around it but it might take more data and a bigger model than necessary and of course we hope that all the new state-of-the-art methods coming out of the deep learning community are going to give us an even better solution okay right empirically determined yeah so the question is for a spectrogram with we talked about these little spectrogram frames being computed from 20 milliseconds of audio and is that number special is there a reason for it so this is really determined from years and years of experience this is captured from the traditional speech community we know this works pretty well there's actually some fun things you can do you can take a spectrogram go back and find the best audio that corresponds to that spectrogram to listen to it and see if you lost anything and spectrograms of about this level of quantization you can kind of tell what people are saying it's a little bit garbled but it's still actually pretty good so amongst all the hyper parameters you could choose this one's kind of a good trade-off in keeping the information but also saving a little bit of the phase by doing it frequently yeah I think in a lot of the models the in the demo for example we don't use overlapping windows they're just adjacent yeah yeah so those results are from from in-house software it Baidu if you use something like open MPI for example on a cluster of GPUs actually works pretty well on a bunch of machines but I think some of the algorithms like all reduce once you start moving huge amounts of data they're not optimal you'll suffer a hit once you start going to that many GPUs within a single box if you use the CUDA libraries to move data back and forth just on a local box that stuff is pretty well optimized and you can often do it yourself okay so I want to take a few more questions at the end and maybe we can run into the break a little bit I wanted to just dive right through a few comments about production here so of course the ultimate goal of solving speech recognition is to improve people's lives and enable exciting products and so that means even though so far we've trained a bunch of acoustic and language models we also want to get these things in production and users tend to care about more than just accuracy accuracy of course matters a lot but we also care about things like latency users want to see the engine send them some feedback very quickly so that they know that it's responding and that it's understanding what they're saying and we also need this to be economical so that we can serve lots of users without breaking the bank so in practice a lot of the neural networks that we use in research papers because they're awesome for beating benchmark results turn out not to work that well on a production engine so one in particular that I think is worth keeping an eye on is that it's really common to use bi-directional recurrent neural networks and so throughout the talk I've been drawing my RNN with connections that just go forward in time but you'll see a lot of research results that also have a pass that goes backward in time and this works fine if you just want to process data offline but the problem is that if I want to compute this neurons output up at the top of my network I have to wait until I see the entire audio segment so that I can compute this backward recurrence and get this response so this sort of anti causal part of my neural network that gets to see the future means that I can't respond to a user on the fly because I need to wait for the end of their signal so if you start out with these bi-directional rnns that are actually much easier to get working and then you jump to using a recurrent network that is forward only it'll turn out that you're going to lose some accuracy and you might kind of hope that CTC because it doesn't care about the alignment would somehow magically learn to shift the output over to get better accuracy and just artificially delay the response so that it could get more context on its own but it kind of turns out to only do that a little bit in practice it's really tough to control it and so if you find that you're doing much worse sometimes you have to sort of engage in model engineering so even though I've been talking about these recurrent networks I want you to bear in mind that there's this dual optimization going on you want to find a model structure that gives you really good accuracy but you also have to think carefully about how you set up the structure so that this little neuron at the top can actually see enough context to get an accurate answer and and not depend too much on the future so for example what we could do is tweak this model so that this neuron at the top that's trying to output the character L and hello can see some future frames but it doesn't have this backward recurrence so it only gets to see a little bit of context that lets us kind of contain the amount of latency in the model you skip over this so in terms of other online aspects of course we want this to be efficient right we want to serve lots of users on a small number of machines if possible and one of the things you think you might find if you have a really big deep neural network or recurrent neural network is that it's really hard to deploy them on conventional CPUs CPUs are awesome for or serial jobs you just want to go as fast as you can for this one string of instructions but as we've discovered with so much of deep learning GPUs are really fantastic because when we work with neural networks we love processing lots and lots of arithmetic in parallel but it's really only efficient if the batch that we're working on the hunks of audio that we're working on are are in a big enough batch so if we just process one stream of audio so that my GPU is multiplying matrices times vectors then my GPU is going to be really inefficient so for example unlike a K 1200 GPU this is something you could put in a server in the cloud what you'll find is that you get really poor throughput considering the the dollar value of this Hardware if you're only processing one piece of audio at a time whereas if you could somehow batch up audio to have say 10 or 32 streams going at once then you can actually squeeze out a lot more more performance from that piece of hardware so one of the things that we've been working on that works really well is not too too bad to implement is to just batch all of the packets as data comes in so if I have a whole bunch of users talking to my server and they're sending me little hundred millisecond packets of audio what I can do is I can sit and I can listen to all these users and when I catch a whole batch of utterances coming in or a whole bunch of audio packets coming in from different people that start around the same time I plug those all into my GPU and I process those matrix multiplications together so instead of multiplying a matrix times only one little audio piece I get to multiply it by a batch of say four audio pieces and it's much more efficient and if you actually do this on a live server and you plow a whole bunch of audio streams through it you could support maybe 10 20 30 users in parallel and as the load on that server goes up I have more and more users piling on what happens is that the GPU will naturally start batching up more and more packets into single matrix multiplications so as you get more users you actually get much more efficient as well and so in practice when you have a whole bunch of users on one machine you usually don't see matrix multiplications happening with fewer than maybe a batch sizes of four so the summary of all of this is that deep learning is really making the the first steps to building a state-of-the-art speech engine easier than they've ever been so if you want to build a new state-of-the-art speech engine for some new language all the components that you need are things that we've covered so far and the performance now is really significantly driven by data and models and I think as we were discussing earlier I think future models from deep learning are going to make that influence of data and computing power even stronger and of course data and compute is important so that we can try lots and lots of models and keep making progress and I think this technology is now at a stage where it's not just a research system anymore we're seeing that the end end deep learning technologies are now mature enough that we can get them into productions I think you guys are going to be seeing deep learning play a bigger bigger role in the speech engines that are powering all the devices that we use so thank you very much I think we're right at the end of time sounds good alright we had one in the back who's waiting patiently go ahead more than one voice simultaneously so the question is how does the engine handle more than one voice simultaneously so right now there's nothing in this formalism that allows you to account for multiple speakers and so usually when you listen to an audio clip in practice it's clear that there's one dominant speaker and so this beach engine of course learns whatever it was taught from the labels and it will try to filter out background speakers and just transcribe the dominant one but if it's really ambiguous then then undefined results you customize the transcription to the specific characteristics of a particular speaker so we're not doing that in these pipelines right now but of course a lot of different strategies have been developed in the traditional speech literature there are things like I've Ector 'z that try to quantify someone's voice and those make useful features for improving speech engines you could also imagine taking a lot of the concepts like embeddings for example and tossing them in here so I think a lot of that is left open to future work I do a question button I think we have to break for time but I'll step off stage here and you guys can come to me with your questions thank you so much so we'll reconvene at 2:45 for presentation by Alex