Foundations and Challenges of Deep Learning (Yoshua Bengio)
11rsu_WwZTc • 2016-09-27
Transcript preview
Open
Kind: captions Language: en Thank You Sammy so I'll tell you about some very high-level stuff today no new algorithm some of you already know about the book that Ian Goodfellow erinkoval and I have written and it's now in presale by MIT press I think you can find it on Amazon or something and paper the actual shipping is going to be in December hopefully for nibs so we've already heard that story at least well from several people here at least from Andrew I think but it's good to ponder a little bit some of these ingredients that seem to be important for deep learning to succeed but in general for machine learning to succeed to learn really complicated tasks of the kind we want to reach human level performance so if a machine is going to be intelligent it's going to need to acquire a lot of information about the world and the big success of machine learning for AI has been to show that we can provide that information through data through examples but but really think about it you know that that machine will need to know a huge amount of information about the world around us this is not how we're doing it now because we're not able to train such big models but it will come one day and so we'll need models that are much bigger than the ones we currently have of course that means machine learning algorithms that can represent complicated functions that's you know one good thing about neural nets but there are many other machine learning approaches that allow you in principle to represent very flexible forms like nonparametric methods classical nonparametric methods or svms but they're going to be missing 0.4 and potentially 0.5 depending on the methods point 3 of course you you need enough computing power to train and use these big models and point-five just says that it's not enough to be able to train the model you have to be able to use it in a reasonably efficient way from a computational perspective this is not always the case with some probabilistic models where inference in other words answering questions having the computer do something can be intractable and then you need to do some approximations which could be efficient or or not now the point I really want to talk about is the fourth one how do we defeat the curse of dimensionality in other words if you don't assume much about the world it's actually impossible to learn about it and and so I'm going to tell you a bit about the assumptions that are behind a lot of deep learning algorithms which make it possible to work as well as we are seeing in practice in the last few years something wrong Microsoft bug okay so how do we bypass the curse of dimensionality the curse of dimensionality is about the exponentially large number of configurations of the space variables that we want to model the number of values that all of the variables that we observe can take is going to be exponentially large in general because there's a compositional nature if if each pixel can take two values and you got a million pixels then you got two to one million number of possible images so the only way to beat an exponential is to use another exponential so we need to make our models compositional we need to build our models in such a way that they can represent functions that look very complicated but yet these models need to have a reasonably small number of parameters reasonably small in the sense compared to the number of configurations of the variables the number of parameters should be small and we can achieve that by by composing little pieces together composing layers together it can put composing units on the same layer together and that's essentially what's happening with deep learning so you actually have two kinds of compositions there's the the compositions happening on the same layer this is the idea of distributed representations which I'm going to try to explain a bit more this is what you get when you learn embeddings for forwards or for images representations in general and then there's the idea of having multiple levels of representation that's the notion of depth and there there is another kind of composition takes place whereas the the first one is a kind of parallel composition I'm you know I can choose the values of my different units separately and then they together represent an exponentially large number of possible configurations in the second case there's a sequential composition where I take the output of one level and and I combine them in new ways to build features for the next level and so on and so on right so so the reason deep learning is working is because the world around us is better modeled by making these assumptions it's not necessarily true that deep learning is going to work for any machine learning problem in fact if if we consider the set of all possible distributions that we would like to work from deep learning is no better than any other and that's this is basically what the no free lunch theorem is saying it's because we are incredibly lucky that we live in this world which can be described by using composition that these algorithms are working so well this is important to really understand this so before I go a bit more into distributed representations let me say a few words about non distributor presentations so if you're thinking about things like clustering engrams for language modeling classical nearest neighbors SVM's with Gaussian kernels classical nonparametric models with local kernels and decision trees all these things the way these algorithms really work is actually pretty straightforward if you you know cut the crap and hide the math and try to understand what is going on they they look at the data in in data space and they break that space into regions and they're going to use different free parameters for each of those regions to figure out what the right answer should be the right answer it doesn't have to be supervised learning even an S provides I think there's a right answer it might be the density or something like that okay and you might think that that's the only way of solving a problem you know we consider all of the cases and we have an answer for each of the cases and we can maybe interpolate between those cases that we've seen the problem with this is somebody comes up with a new example which isn't in between two of the examples we've seen something that a la requires us to extrapolate something that's you know non-trivial generalization and and these algorithms just fail they don't they don't really have a recipe for saying something meaningful away from the training examples there's another interesting thing to note here which I would like to you to keep in mind before I show the next slide which is in red here which is we can do a kind of simple counting to relate the number of parameters a number of free parameters that can be learning and the number of regions in the data space that we can distinguish so here we basically have linear relationship between these two things right so for each region I'm going to need at least something like some kind of Center for the region and maybe if I need to output something I'll lean an extra set of parameters to tell me what the answer should be in that area so the number of parameters grows linearly with the number of regions that I I'm going to be able to distinguish the good news is I can have any kind of function right so I can break up the space in any way I want and then for each of those regions I can have any kind of output that I need so for decision trees the regions would be you know splitting across axes and so on and for this is more like four nearest neighbor or something like that now another bug I don't think I will send this hope works this time oh I have a another option sorry about this okay so so here's the the point of view of the suit representations for solving the same general machine learning problem we have a data space and we want to break it down but we're going to break it down in a way that's not general we're going to break it down in a way that makes assumptions about the data but it's going to be compositional and it's going to allow us to you know be exponentially more efficient so how are we going to do this so in the picture on the right what you see is a way to break the input space by the intersection of half-planes and this is the kind of thing you would have with a what happens at the first layer of a neural net so here imagine the input is 2-dimensional so I can plot it here and I have three binary hidden units c1 c2 c3 so because they're binary you can think of them as little binary classifiers and because it's only a one layer net you can think of what they're doing is a linear classification and so those colored hyperplanes here are the decision surfaces for each of them now these three bits there can take they can take eight values right corresponding to you know whether each of them is on or off and and those different configurations of those bits correspond to actually seven regions here because there's one of the eight regions which does is not feasible so so now you see that we're defining a number of regions which is corresponding to all of the possible intersections of the corresponding half-planes and and now we can play the game of how many regions do we get for how many parameters and what we see is that as if we played the game of growing the number of dimensions features and also of inputs we can get an exponentially large number of regions which are all of these intersections right there's an exponential number of these intersections corresponding to different binary configurations yet the number of parameters grows linearly with the number of units so it looks like we're able to express a function then on top of that I could imagine you have a linear classifier right that's that's the one hidden layer new on that so so the number of parameters grows just linearly with the number of features but the number of regions that the network can really provide a different answer to grows exponentially so this is very cool and the reason it's very cool is that it allows those neural nets to generalize because while we're learning about each of those features we can generalize to regions we've never seen because we've learned enough about each of those features separately I'm going to give you an example of this in a couple of slide actually it's let's do it first so so think about those features so the input is an image of a person and think of those features as things like I have a detector that says that the person wears glasses and I have another unit that's detecting that the person is a female or male and I have another unit that texts that the person is a child or not and you can imagine you know hundreds or thousands of these things of course so so the good news is you could imagine learning about each of these feature detectors these little classifiers separately in fact you could do better than that you could share you know intermediate layers between the input and those features but but let's you know take even the worst case and imagine we were to train those separately which is the case in the linear model that I show before we have a separate set of parameters for each of these detectors so if I have n features each of them say needs order of K parameters then I need order of NK parameters and I need order of NK examples and one thing you should know from you know which machine learning theory is that if you have order of P parameters you need order of P examples to do a reasonable job of jaw's age of journalizing you can you can get around that by regularizing and effectively having less degrees of freedom but but you know to keep things simple you need about the same number of examples or maybe a hundred times more or ten times more as the number of really free parameters so so now the relationship between the number of regions that I can represent and the number of examples I need is quite nice because the number of regions is going to be to to the number of features of these binary features so you know a person could wear glasses or not be a female or a male or child or not and I could have a hundred of these things and I could probably recognize reasonably well all of these two to the 100 configurations of people even though I've obviously not seen all of those to do 100 configurations why is it that I'm able to do that I'm able to do that because the the models can learn about each these binary features kind of independently in the sense that I don't need to see every possible configuration of the other features to know about wearing glasses like I can learn about wearing glasses even though I've never seen somebody who was a female and a child and chubby and had you know yellow shoes and and and I have seen enough examples of people wearing glasses I can learn about wearing glasses in general I don't need to see all of the configurations of the other features to learn about one feature okay and so so this is really what what you know why this thing works is because we're making assumptions about the data that those features are meaningful by themselves and you don't need to actually have data for each of the regions the exponential number of regions in order to learn the proper way of detecting or lore of discovering these these these intermediate features let me add something here there were some experiments recently actually showing that this kind of thing is really happening because the features I was talking about not only I'm assuming that they exist but the the optimization methods or training procedures discover them they can learn them and this is an experiment that's been done in 2012 all Tour Alba's lab at MIT where they trained a usual confidence to recognize places so the outputs of the net are just the types of places like is this a beach scene or an office scene or street scene and so on but but then the the thing they've done is they ask people to analyze the the hidden units to try to figure out what each hidden unit was doing and they found that there's a large proportion of units that humans can find a pretty obvious interpretation for what those units like so so they see a bunch of units which you know like people are different kinds of people or animals or buildings or seedings or tables lighting and so on so it's like if indeed the those neural nets are discovering semantic features they are semantic because actually people give them names as the intermediate features you know in order to reach the final goal of here transpiring scenes and the reason they're generalizing is because now you can combine those features in an exponentially large number of ways right you could have a scene that has a table different kind of lighting some people you know maybe a pet and and you can say something meaningful about the combinations of these things because the network is able to learn all of these features without having to see all of the possible configurations of them so I don't know if my explanation makes sense to you but now is the chance to ask me a question all clear usually it's not yeah with decision trees right to some extent so if the question is can't we do the same thing with a set of decision trees yeah in fact this is one of the reasons why forests work better or bagged trees work better than single trees forests or actually or Bank trees are like one layer one level deeper than a single trees but but they still don't have as much of a sort of distributed aspect as neural nets so they be and and usually they're not trained jointly I mean boosted trees are you know to some extent in a greedy way but yeah any other question yeah cases where what non-conditional non computer vision non compositional I don't understand the question I mean I don't sound what you mean what do you mean non compositional yeah it's everywhere around us I don't think I don't think that there are examples of neural nets that really work well where the data doesn't have some kind of compositional structure in it but if you come up with an example I'd like to hear about it okie s yes to think about this issue in graphical model terms is is if it can be done but you have to think about not feature detection like I've been doing here but about generating an image or something like that right then it's easier to think about it so so the same kinds of things happen if you think about how I could generate an image if you think about underlying factors like which objects where they are what's their identity what's their size these are all independent factors which you compose together in in funny ways if you were to do a graphics engine you can see exactly what those ways are and it's much much easier to represent that joint of distribution using this compositional structure then if you're trying to work directly in the pixel space which is normally what you would do with a classical nonparametric method and it wouldn't work but if you look at our best D generative models now for images for example like ganz or V AES they're really you know we're not there yet but they're amazingly better than anything that people could dream up just a few years ago in in machine learning okay let me move on because of other things to talk about so this is all kind of hand wavy but some people have done some math around these ideas and and so for example there's one result from two years ago I clear where we study the single layer case and we consider a network with rectifiers rellis and we find that the the network of course computes a piecewise linear function and so one way to quantify the richness of the function that it can compute I was talking about regions here but well you can do the same thing here you can count how many pieces does does this network have in its input to output function and and it turns out that is it six potential in in the number of inputs well it's a number of units to the power number of inputs so that's for a sort of district representation there's this an exponential kicking in we also studied the the depth aspect so what you need to know about depth is that there's a lot of earlier theory that says that a single layer is sufficient to represent any function however that theory doesn't specify how many units you get you might need and in fact you might need an especially large number of units so what several results show is that there are functions that can be represented very efficiently with few units so few parameters if you allow the network to be deep enough so out of all the functions again it's a luckiness thing right out of all the functions that exists there's a very very small fraction which happen to be very easy to represent with a deep network and if you try to represent these these functions with a shallow network you're screwed you're going to need an exponential number of parameters and so you're gonna need an exponential number of examples to learn these things but again we're incredibly lucky that the function we want to learn have this property but in the sense it's not surprising I mean we use this kind of compositionality and depth everywhere we when we write a computer program we just don't have like a single main we have you know functions and call functions and and we were able to show similar things as what I was telling you about for the single layer case that as you increase depth for these deep relu networks the number of pieces in the piecewise linear function grows exponentially with the depth so so it's it's already exponentially large with a single-layer but it gets exponentially even more with a deeper net okay so so this this was a topic of representation of functions why why deep learn deep architectures can can be very powerful if we're lucky and we seem to be looking the other another topic I want to mention that's kind of very much in the foundations is how is it that we're able to train these neural nets in the first place in the 90s many people decided to not do any more research on your nuts because there were 30 Korra's ult's showing that there are really an exponentially large number of local minima in the training objective in of a neural net so in other words the function we want to learn has many of these holes and if we start at a random place well what's the chance we're going to find the best one the the one that corresponds to a good cost and that was one of the motivations for people who flocked into a very large area of research in machine learning in the 90s and 2000's based on algorithms that require on the convex optimization to Train because of course if we can do context optimization we eliminate this problem if if the objective function is convex in the parameters then we know there's a single global minimum right so let me show you a picture here you get a sense of if you look on the right hand top this is if you draw a random function in 1d or 2d or 3d like here this is a kind of a random smooth function in 2d you see that is going to have many ups and downs this is a local minimum and but but the good news is that in high dimension it's a totally different story so what are the dimensions here we're talking about the parameters of the model and the vertical axis is the cost we're trying to minimize and what happens in high dimension is that instead of having a huge number of local minima on our way when we're trying to optimize what we encounter instead is a huge number of saddle points so saddle point is like the thing on the bottom right in in 2d so you have two parameters and y-axis is the cost you want to minimize and so what you see in a saddle point is yeah you have dimensions or directions where the the objective function draws a a minimum so there's like a curve that it curves up and in other directions it curves down so we are you know saddle point has both a minimum in some direction and a maximum in other directions so this is this is interesting because even though it's a these these points like saddle points and many more are places where you could get stuck in principle if you're exactly at the subtle point you don't move but if you move a little bit away from it you will go down the saddle right so what what our work in the other paper other work from NYU tremonica and collaborators of Yann Locker showed is that actually in very high dimension not only you know it's it's the issue is more saddle points than local minima but but the local minima are good so let me try to explain what I mean by this so let me show you actually first an experiment from from the NYU guys so they did an experiment where they gradually change the size of the neural net and they they look at what looks like local minima but they could be you know saddle points that are the lowest that they could obtain by training and what you're looking at is a distribution of errors they get from different initialization of their training and so what happens is that when the network is small like the pink here on the right there's a widespread distribution of cost that you can get depending on where you you you start and they're pretty high and if you increase the size of the network it's like all of the local minima that you find concentrate around a particular costs so you don't get any of these bad local minima that you would get with a small Network they're all kind of pretty good and if you increase even more the size of network this is like a single hidden layer network you know not very complicated this phenomenon increases even more in other words they all kind of converge to the same kind of costs so let me try to explain what's going on so if we go back to the picture of the saddle point but instead of being in 2d imagine you are in a million D and in fact you know people have billion D networks these days I'm sure andrew has even bigger ones I'm not sure but so what happens in this very high dimensional space of parameters is that if if things are not really you know really bad for you so if you imagine a little bit of randomness in the way the problem is set up and there it seems to be the case in order to have a true local minimum you need to have the curvature going up like this in all the you know billion directions so if there is a certain probability of this event happening that all know that this particular directions is curving up and this one is grabbing up the probability that all of them curve up becomes exponentially small so we we tested that experimentally what you see in the bottom left is a curve that shows the training error as a function of what's called the index of the critical point which is just the fraction of the directions which are curving down right so so 0% would mean it's a local minimum a hundred percent would be it's a local maximum and anything in between is a saddle point so what we find is that as training progresses we're going close to a bunch of saddle points and these and none of them are local minima otherwise we would be stuck and and in fact we never encounter local minima until we reach the lowest possible cost that we were able to get in addition there is a theory suggesting that so the the local the low the the local minima will actually be close in cost to the global minimum they will be above and they will concentrate in a little band above the global minimum but that band of local minima will be close to the global minimum and and the larger 2-dimension the more this is going to be true so as you go to go back to my analogy right at some point of course you will get local minima even though it's unlikely when you're in the middle when you get close to the bottom well you can't go lower so you know it has to rise up in all the directions but it's yeah so that's kind of good news I think in spite of this I don't think that the optimization problem of neural nets is solved there are still many cases where we find ourselves to be stuck and we still don't understand what the landscape looks like this set of beautiful experiments by in Goodfellow that help us visualize a bit what's going on but I think one of the open problems of optimization for neural nets is we know what does the landscape actually look like it's hard to visualize of course because it's very high dimensional but for example we don't know what those saddle points really look like when we actually measure the gradient near those when we're approaching those saddle points is it's not close to zero so we never go to actually flat places this may be too due to the fact that we're using SGD and it's kind of hovering above things there might be conditioning issues or even if you are at a cell nearer saddle point you might be stuck even though it's not a local women because in many directions it's still going up maybe you know 95% of the directions and and the other directions are hard to reach because simply there's a lot more curvature in some directions and other directions and that's you know the traditional ill conditioning problem we don't know exactly you know what what's making it hard to try in some some networks usually continents are pretty easy to train but when you go into things like machine translation or even worse reasoning tasks like with things like you know Turing machines and things like that it gets really really hard to train these things and people have to use all kinds of tricks like curriculum learning which are essentially optimization tricks to make the optimization easier so I don't want to tell you that all the optimization problem of neural nets is easy it's done we don't need to worry about it but it's much easier and less of a concern than what people thought in the 90s ok so so was she learning I mean deep learning is moving out of pattern recognition and into more complicated tasks for example including reasoning and and and combining deep learning with reinforcement learning planning and things like that you've heard about attention that's one of the tools that is really really useful for many of these tasks we've sort of come up with attention mechanisms as not a way to focus on what's going on in the outside will like we usually think of attention like attention in the visual space but internal attention right in the space of representations that have been built so that's what we do here in machine translation and it's been extremely successful as quark said so I'm not going to show you any of these pictures blah blah another so I'm getting more now into the domain of challenges a challenge that I've been working on since I was a baby researcher as a PhD student is long-term dependencies and recurrent Nets and although we've made a lot of progress this is still something that we haven't completely cracked and it's connected to the optimization problem that I told you before but it's a very particular kind of optimization problem so some of the ideas that we've used to try to make the propagation of information and gradients easier include using skip connections over time include using multiple time scales there's some recent work in this direction from from my lab and other groups and even the attention mechanism itself you can think of a way to help dealing with with long term dependencies so the way to see this is to think of the place on which we're putting attention as part of the state right so so imagine really you have a recurrent net and it has two kinds of state it has the usual recurrent net state but it has the content of the memory you know Kwok told you about memory nets and neural Cheng machines and the full state really includes all of these things and and now we are able to read or write from that memory I mean the little recurrent net is able to do that so what happens is that there are memory elements which don't change or time maybe they're being written once and and so the information that has been stored there it can stay for as much time as you know they're not going to be overwritten so so that means that if you consider the gradients back propagated through those cells they can go pretty much unhampered and there's no vanishing gradient problem so this is something that to be that that view of the problem of long-term dependence sieze with memory i think is could be very useful all right in the last part of my presentation I want to tell you about what I think is the biggest challenge ahead of us which is unsupervised learning any question about attention and memory before I move on to and provides learning ok so why do we care about unsupervised learning it's not working well actually it's working a lot better than it was but it's still not something you find in industrial products at least not in an obvious way there are less obvious ways where unsupervised learning is actually already extremely successful so for example when you train word embeddings with word to Veck or any other model and you use that to pre train like we did our machine translation systems or other kinds of NLP tasks you're you're exploiting as provides learning even when you train a language model that you're going to stick in some other thing or pre train something with that you're also doing unsupervised learning but I think the potential of and the importance of ents provides learning is is usually underrated so why do we care first of all the idea of ins provides learning is that we can train we can we can learn something from large quantities of unlabeled data that humans have not curated and we have lots of that humans are very good at learning from unlabeled data I have an example that I used often that is makes it very very clear that for example children can learn all kinds of things about the world even though no one no no no adult ever tells them anything about it until much later when is too late physics so you know a two or three year old understands physics you know if she has a ball she knows what's gonna happen when she drops the ball she knows you know how liquids behave she knows all kinds of things about objects and an ordinary Newtonian physics even though she doesn't have explicit equations and a way to destroy them with words but she can predict what's going to happen next right and the parents don't tell the children you know force equals mass times acceleration right so this is purely unsupervised and it's very powerful we don't even have that right now we don't have computers that can understand the kinds of physics that children can understand so it looks like it's a skill that humans have and that's very important for humans to make sense of the world around us but we haven't really yet succeeded to put in machines let me tell you other reasons that are connected to this why unsupervised learning to be useful when you do supervised learning essentially the way you train your system as you you you you focus on a particular task those here's the inputs and here's the the input variables and here's an output variable that I would like you to predict given the input your learning P of Y given X but if you're doing as provides learning essentially you're learning about all the possible questions that could be asked about the data of your observe so it's not that you know there's X 1 X 2 X 3 and Y everything is an X and you can predict any of the X given any of the other X right if I give you a picture and I had a part of it you can guess what's missing if I hide if I hide the you know the caption you can generate the caption given the image if I hide hide the image and I give you the caption you can you can you know guess what the image would be or draw it or figure out you know from examples which one is the most appropriate so you can answer any questions about the data when you have captured the Joint Distribution between them essentially so that's that could be useful another practical thing that ins provides learning has been used in fact this is how the whole deep learning thing started is that it could be used as a regular Iser because in addition to telling our model that we want to predict Y given X we're saying find representations of X that both predict Y and somehow capture something about the distribution of X know the leading factors the explanatory factors of X and this again is making an assumption about the data so we can use that as a regular Iser if the assumption is valid that the essentially the assumption is that the factor Y that we're trying to predict is one of the factors that explain X and that by doing this provides learning to discover factors that explain X we're going to pick Y among the other factors and so it's going to be much easier now to do supervised learning of course this is also the reason why transfer learning works because there are underlying factors that explain the inputs for a bunch of tasks and maybe a different subset of factors explained are relevant for one task and another subset of factors is relevant for another task but if these factors overlap then there's a potential for synergy you know by doing multi task learning so the reason multi task learning is working is because unsupervised learning is working is because there are representations and factors that explain the data that can be useful for our supervised learning tasks of interest that also could be used for domain adaptation for the same reason um the other thing that people don't talk about as much about unsupervised learning and I think it was part of the initial success that we had with stacking auto-encoders and rbms is that you can actually make the optimization problem of training deep nets easier because if you're gonna you know for the most part if you're gonna train a bunch of RBMS or a bunch of voto encoders and I'm not saying this is the right way of doing it but you know it captures some of the spirit of what ins provides learning does a lot of the learning can be done locally you're trying to extract some information you're trying to discover some dependencies that's that's a local thing once you have a slightly better representation we can again tweak it to extract better more independence or something of that so so there's a sense in which the optimization problem might be easier if you have a very deep net another reason why we should care about unsupervised learning even if our ultimate goal is to do supervised learning is because sometimes the output variables are complicated they are compositional they have a Joint Distribution so in machine translation which we talked about the output is a sentence the sentence is a set of as a couple of words that have a complicated Joint Distribution given the input in the other language and so it turns out that many of the things we discover by exploring unsupervised learning which is essentially about capturing joint distributions can be often used to deal with these structured output problems where you you have many outputs that form a you know compositional complicated distribution there's another reason why unsupervised learning I think is going to be really necessary for AI model-based reinforcement learning so I think I have another slide just for this let's think about self-driving cars is very popular topic these days how did I learn that I shouldn't do some things with the wheel that will kill myself right when I'm driving because I haven't experienced these states where I get killed and I simply haven't done it like a thousand times to get learn how to avoid it so supervised learning where we're our rather you know traditional reinforcement learning like and policy learning kind of thing or actor critic or things like that won't work because I need to generalize about situations that I'm never going to encounter because otherwise if I did I would die so these are like dangerous states that I need to generalize about these states but I you know can't have enough data for them and and I'm sure there are lots of machine learning applications where we would be in that situation I remember a couple of decades ago I you know I've got some data from nuclear plant and so you know they wanted to predict that you know when it's gonna blow up to avoid it so I said how many how many yeah it's at zero right so you see sometimes it's hard to do supervised learning because the data you would like to have you can't have it's it's it's data that you know situations that are very rare or you know so how can we possibly solve this problem well the only solution I can see is that we learn enough about the world that we can predict how things would unfold right when I'm driving you know I have a kind of mental model of physics and how cars behave that I can figure out you know if I turned right at this point I'm going to end up on the wall and it's going to be very bad for me and I don't need to actually experience that to know that it's bad I can make a mental simulation of what would happen so I need a kind of generative model of how the world would unfold if I do such and such actions and unsupervised learning is sort of the ideal thing to do that but of course it's going to be hard because we're going to have to train models that capture a lot of aspects of the world in order to be able to learn to generalize properly in those situations even though they don't see any data of it so that's that's one reason why I think reinforcement learning needs to be worked on more so I have a little thing here I think people who have been doing deep learning can collaborate with people who are doing reinforcement learning and not just by providing a black box that they can use in their usual algorithms I think there are things that we do in supervised deep learning that orange provides deep learning that can be useful in sort of rethinking our enforcement learning so so one example also so well one thing I really like to think about is credit assignment in other words how do different machine learning algorithms figure out what the hidden units are supposed to do what the intermediate computations or the intermediate actions should be this is what credit assignment is about and that prop is the best recipe we currently have for doing credit assignment it tells the you know parameters of some intermediary should change so that the costs much much later you know hundred steps later if it's a recurrent net should be reduced so we could probably use some inspiration from backrub and how it's used to improve reinforcement learning and one such cue is how when we do supervised backprop say we don't predict the expected loss that we're going to have and then try to minimize it where the expectation would be over the different realizations of the correct class that's not what we do but this is what people do in RL they they will learn a critic or a cue function which is the expected learning the expected value of the future reward or the future loss in our case that might be you know minus log probability of the correct answer given the input and then they will backdrop through this or use it to estimate the gradient on the actions instead when we when we do supervised learning we're going to do credit assignment where we use the particular observations of the correct class that actually happened for this X right we have X we have Y and we use the Y to figure out what how to change our prediction or action so it looks like this is something that should be done for our L and in fact we we have a paper on something like this for a sequence prediction this is this is the kind of work which is at the intersection of dealing with structured outputs reinforcement learning and service learning so I think there's a lot of potential benefit of changing the frame of thinking that people in the RL have had for many decades people in RL I mean not thinking about the world in with the same eyes as people doing your net they've been thinking about the world in terms of discrete states that could be enumerated and proving theorems about these algorithms that depend on essentially you know collecting enough data to fill all the possible configurations of the state and their you know the corresponding effects on the reward when you start thinking in terms of neural nets and deep learning the way to approach problems is very very different okay let me continue about as provides learning and why this is so important if you look at the kinds of mistakes that our current machine learning algorithms make you find that our our neural nets are just cheating they're using the wrong cues to try to produce the answers and sometimes it works sometimes it doesn't work so how can we make our our models be you know smarter make less mistakes well the only solution is to make sure that those models really understand how the world works at least at the level of humans to get human level accuracy human level performance it may be not necessary to do this for a particular problem you're trying to solve so maybe we can you know get away with doing speech recognition without really understanding of the meaning of the words probably that's going to be okay but for other tasks especially those involving language I think having models that actually understand how the world tix is going to be very very important to so how could we have machines that understand how the world works well one of the ideas that I've been talking a lot about in the last decade is that of disentangling factors of variation this is related to a very old idea in pattern recognition computer vision called invariance the idea of invariance was that we would like to compute or design initially design and now learn features say of the image that are invariant to the things we don't care about maybe we want to do object recognition so we don't care about position or orientation so we would like to have features that are translation invariant rotation invariant scaling invariant whatever so this is what invariance is about but when you're in the business of doing ends provides learning of trying to figure out how the world works it's not good enough to do two extracting variant features what we actually want to do is to extract all of the factors that explain the data so if we're doing speech recognition we want not only to extract the phonemes but we also want to figure out you know what kind of voice is that maybe who is it what kind of recording conditions or what kind of microphone is it in a car is it outside all that information which you're trying to get rid of normally you actually want to learn about so that you'll be able to generalize even to new tasks for example maybe the next day I'm not going to ask you to recognize phonemes but recognize who's speaking more generally if we're able to disentangle these that explained how the data varies everything becomes easy especially if those factors now can be generated in an independent way and to generate the data we we can for example we can learn to answer a question that only depends on one or two factors and basically we eliminate all the other ones because we've separated them so a lot of things become much easier so that's one notion right we can design tangle factors there's another notion which is the notion of multiple levels of abstraction which is of course at the heart of what we're trying to do with deep learning and the idea is that we can have representations of the world representation of the data as you know description that involves factors are features and we can do that at multiple levels and there are more abstract levels so if I'm looking at a document you know there's the level of the pixels the level of the strokes the level of the characters the level of the words and maybe the level of the meaning of individual words and we actually have you know systems that will recognize from a scanned document all of these levels when we go higher up we're not sure what the right levels are but clearly there must be representations of the meaning not just of single words but of you know sequences of words and the whole paragraph what's the story and why is it important to represent things in that way because higher levels of abstraction are representations from which it is much easier to do things to answer questions so the the more semantic levels mean basically we can very easily act on the information when it's represented that way if you think about the level of words it's much easier to check whether a particular word is in the document if I have the words extracted then if I have to do it from the pixels and if I have to answer a complicated question about you know the intention of the person working at level of words is not high enough it's not abstract enough I need to work at a more abstract level which in which maybe the same notion could be represented with many different types of words where many different sentences could express the same meaning and I want to be able to capture that meaning so the last slide I have is something that I've been working on in the last couple of years which is trying to which is connected to ants provides learning but more generally to the relationship between how we can build intelligent machines and and the intelligence of humans or animals and as you may know this was one of the key motivations for doing neural nets in the first place the intuition is this that we are hoping that there are a few simple key principles that explain you know what allows us to be intelligent and that if we can discover these principles of course we can also build machines that are intelligent that's why the neural nets were you know inspired by things we know from the brain in the first place we don't know this is true but if it is then you know it's it's it's great and I mean this would make it much easier to understand how brains work as well as building AI so in in trying to bridge this gap because right now our best neural nets are very very different from what's going on in brains as far as you know we can tell by talking to neuro scientists in particular backprop although it's it's kicking Assam from a machine learning point of view it's not clear at all how something like this would be implemented in brains so I've been trying to explore that and and also trying to see how we could generalize those credit assignment principles that would come out in order to also do once provide learning so we've we've made a little bit of progress a couple of years ago I came up with an idea called target prop which is a way of generalizing back prop 2 propagating targets for each layer of course this idea has a long history more recently we've been looking at ways to implement gradient estimation in deep recurrent networks that perform some computation that turn out to end up with parameter updates corresponding to gradient descent in the prediction error that looked like something that neuroscientists have been observing and and don't completely understand called SCDP spike timing-dependent plasticity so I don't really have time to go into this but I think this whole area of reconnecting neuroscience with machine learning and neural nets is something that has been kind of forgotten by the the machining community because we're all so busy you know building self-driving cars but I think over the long term it's a it's a very exciting prospect thank you very much yes questions yeah to begin with great talk my question is regarding you know the lack of interlab between the results in the study of complex networks like when they study the brain networks right there lot of publications which that talk about the emergence of hubs and especially a lot of publications on the degree distribution of the inter neuron Network right but then when you look at the degree distribution of the so-called neurons in deep Nets you don't get to see the emergence of
Resume
Categories